The Part of the Build Nobody Talks About
Introduction
At the start of the week, the plan sounded simple: test the remaining options, compare the outputs, and figure out which tools were worth building around. What actually happened was less about finding the most impressive demo and more about figuring out which tools could survive real agency work — client brand rules, logo constraints, tone, layout requirements, and output that someone actually has to edit and sign off on. That is a harder test than most demos are designed for.
The biggest shift wasn’t about which tool won. It was realising that intake isn’t a setup step before the work begins. For AI workflows, intake is part of the actual system.
The tools aren’t doing the same thing — and one of them was mostly hype
The website tool evaluation started with three options and ended with one. Stitch was the most disappointing. The demos looked promising — quick site concepts, clean visual output, useful as a starting point. In practice it kept inventing brand names, dropped logos or replaced them with unrelated icons, and produced designs with no real hierarchy. A law firm and a streetwear brand came out looking nearly identical, just in different colours. Rather than saving time, it created correction work before the output could even be properly evaluated. One thing we learned quickly is that a tool being backed by a major company doesn’t automatically make it production-ready. Google Labs is an environment for experimentation, and Stitch felt exactly like that
The Claude pipeline we were already building held up. Clean HTML, responsive layouts, consistent design systems across pages. But the most important thing wasn’t the model — it was how directly the output reflected the brief. Detailed intake produced sites that felt intentional. Vague intake produced something generic. Claude wasn’t guessing. It was responding to exactly what it was given.
On the image side, the same lesson arrived faster and louder.
The image pipeline has been running long enough to fail in interesting ways, which has been more useful than looking at the clean wins. Some models were ruled out quickly. Others were clearly stronger at following visual references and preserving brand direction. But the bigger finding was that brand consistency isn’t only a model problem — it’s an intake problem first.
When we tested with just a logo, a colour name, and a short prompt, results looked polished but felt generic. Right colours, wrong personality. When we included detailed brand guidelines, image style preferences, logo rules, tone, and examples of past creative work, the difference was significant. We saw this most clearly recreating LinkedIn graphics from a real client brand kit — the outputs landed close to the actual brand in a way that general-purpose tools hadn’t come close to. Because Image AI stores and passes the full brand context into every generation request, the model isn’t starting from scratch each time.
The weaker outputs came from rules that were never written down. Logo exclusion zones, spacing habits, how much negative space the brand tends to use — things a designer just knows. The system can’t enforce rules it doesn’t have. That’s not a model failure. It’s an incomplete brief. Those are very different problems.
What the research phase actually produced
The most useful output from this week wasn’t a website or a graphic. It was a clearer picture of what the intake framework needs to be. Both projects pointed to the same conclusion: brand identity, design preferences, content goals, logo rules, tone, platform requirements, and examples of past work all need to become part of the system — collected properly once and reused everywhere, not reconstructed from a logo file at the start of every campaign.
A lot of AI tools make the same promise: give us a prompt and we’ll give you an output. But real client work doesn’t live in one prompt. It lives in the context around the prompt — the brand kit, the old website, the things the designer knows not to do, the review comments that never made it into a formal guideline.
The question we’re asking has shifted. It’s no longer “which AI tool produces the best output?” It’s “which workflow gives the AI enough context to produce something useful, editable, and on-brand?” The model is one part of that. The intake, the pipeline, and the review steps are the rest. Week three was less flashy than week two, but probably more important. The gap between a promising demo and a production workflow is still real. For us, it’s starting to look less like a model problem and more like a context problem — which is less magical, but a lot more solvable.