AI Agency Buy-Side Brief — Q2 2026

We get the same call about once a week now. A head of marketing or a chief of staff has been told to run a vendor evaluation against the AI-marketing agency category and does not have a working brief to evaluate against. The category has bifurcated. The enterprise consultancies are pitching agentic implementations next to the operator shops that are pitching the same thing for a tenth of the price. The mid-tier digital agencies are claiming to have rebuilt their stack and most of them haven’t. The boutique AI-native operators are credible but small and hard to compare to each other.

This piece is the brief we send when the call comes in. It is the working frame for a Q2 2026 vendor evaluation in this category. It includes the shops worth a seat at the table, the questions to ask in the first diligence call, the pricing benchmarks we have been able to triangulate, and a procurement checklist we use ourselves. It is not a ranked list. The ranked-list pieces in this category are mostly the agencies ranking themselves.

The current shape of the field

The field as it stands in Q2 2026 has roughly four tiers. The honest frame for an in-house buyer is that the question is not which tier is best. The question is which tier is the right partner for the work the buyer is actually asking to be done.

Tier 1 — the enterprise consultancies. Bain’s Vector AI practice, BCG’s AI@Scale practice, EY’s AI Consulting Services, and Accenture’s AI Refinery are all selling serious agentic implementations to the Fortune 500. The pricing is enterprise. The delivery is heavy on change management. The work is real and most mid-market companies will never see it.

Tier 2 — the mid-tier AI marketing agencies. NoGood runs the strongest performance-and-growth pitch in this tier and ranks first on its own widely referenced list. Designity leads the creative-plus-AI-workflow positioning. NinjaPromo appears on most of the 2026 category lists and runs case studies on AI-driven campaigns. Improvado operates at the analytics-and-agent-stack intersection. Superside runs a design-subscription model with AI-assisted creative. The tier is credible. The differentiation between the shops is real but narrower than their decks suggest.

Tier 3 — the agentic-operator shops. Refresh Agent sells an “AI agent for business growth” running an n8n + Claude + MCP stack and publishes a complete “How to Build an Agentic Marketing Agency in 2026” implementation playbook. Octave is the B2B GTM messaging-and-segmentation infrastructure that backed Bonfire VC has endorsed as “the messaging brain B2B GTM has been missing”. A small but growing set of boutique AI-native shops in Chiang Mai, Lisbon, and Bangalore run comparable operator stacks at lower input cost, with the operator economics doing the rest. The tier is the one that has produced the most interesting operator economics in the last twelve months.

Tier 4 — the vertical specialists. Real-estate AI shops including Baania in Chiang Mai. Legal-AI implementation shops orbiting the Harvey AI and Lexion ecosystems. Healthcare-AI shops orbiting Hippocratic, Abridge, and Suki. Vertical specialists are the right partner when the buyer’s market is itself vertical and the generic marketing stack does not produce the right work. We have not pressure-tested every name in this tier, and the buyer should run extra diligence here before signing.

Which tier is the right partner

The working call is roughly this. If the buyer is a Fortune 500 with a change-management problem the size of the work itself, Tier 1 is the right partner. The premium is real and the delivery model is built for the problem. If the buyer is a venture-backed company between Series B and Series D, Tier 2 is usually the right partner — the shops can move at the pace, the pricing is in the band, and the delivery model is built for the kind of brand work that produces growth at that stage.

If the buyer is a small-to-mid-market company or a Series A startup, Tier 3 is the answer most of the time. The operator economics of the agentic-shop model produce more work for the same dollar than Tier 2 will, and the senior judgment of the operator usually beats the average judgment of a mid-tier agency account team. The catch is that Tier 3 shops are small. They can be sold out. They can have the wrong vertical fit. They have to be diligenced on the actual operator who will run the work, not on the agency’s marketing pages.

If the buyer is in a regulated or specialized vertical, Tier 4 is at least in the room. The vertical specialists usually cannot do everything a generalist agency can, and a working engagement often blends a Tier 4 vertical partner with a Tier 2 or Tier 3 generalist running the rest of the stack.

The seven questions to ask in the first diligence call

The pattern that has worked in vendor evaluations we have shadowed is to run a structured first call that surfaces the working facts and weeds out the marketing pages. The seven questions below are the ones we use. They are blunt by design. The agencies that handle them cleanly are the agencies that have done the work. The agencies that handle them with deflection have not.

1. Walk me through the agentic stack you run on a real engagement, layer by layer. The right answer names the orchestration layer (n8n, Make, LangGraph, custom), the model layer (Claude, GPT, Gemini, and the criteria for choosing between them), the tool layer (MCP servers, browser automation, retrieval), the review surface, and the governance layer. The wrong answer is “we use AI.” The wrong answer is more common than it should be.

2. Who, by name, will be the senior operator on our account, and what is their working time commitment per week? The agencies with strong operators give you a name and a defensible hour count. The agencies that will staff you with juniors will not give you a name until the contract is signed. The diligence pattern is to insist on the name. The agency that will not provide it is the agency that does not have the answer.

3. Show me an output sample from a brand you have stopped working with. The samples on the agency’s own site are the agency’s best work. The samples from a brand the agency no longer works with are the agency’s median work. The median is what you will get. The agency that cannot provide the median is the agency that does not want you to see it.

4. What is the failure rate on the workflows you run, and what is the recovery model when one breaks? The agencies that have operated at scale answer this with numbers and a recovery procedure. The agencies that have not pivot to talking about quality controls. The numbers vary by stack and by engagement, and the absolute number matters less than the answer’s specificity. The agency that has a number has thought about the problem.

5. How do you measure citation rate inside answer engines, and what is the citation rate today for the brands you work with? This is the GEO question and it is now mandatory. We have written about the discipline at length in our GEO synthesis piece and in our original GEO vs SEO piece. The agencies that have moved on GEO have a method and a working number. The agencies that have not are still selling the 2019 playbook.

6. What does the engagement look like at month nine, when the easy wins have shipped and the residual work is the harder part? The agencies that have stayed engaged with brands past the honeymoon answer this with a clear pattern. The agencies that churn out their accounts at the twelve-month mark pivot to talking about month three.

7. What kind of buyer is the wrong fit for you? The agencies that have thought about positioning answer this cleanly. The agencies that have not say “we work with everyone.” The latter is the answer that disqualifies an agency more often than any other single answer.

Pricing benchmarks we have triangulated

Real pricing benchmarks in this category are hard to triangulate cleanly because the engagements are usually scoped, not productized, and the comparable units of work shift by engagement. The bands below are directional and we have seen them move materially across deals.

Tier 1 (enterprise consultancies). Six-figure to low-seven-figure engagement values are the floor. Most engagements above $500K total. The pricing model is heavy on partner time, project-based, and tied to a defined deliverable rather than a continuing relationship.

Tier 2 (mid-tier AI marketing agencies). Monthly retainers in the $25K to $75K band for a meaningful engagement. The upper end of the band buys the senior account team and the broader stack. The lower end is starter-engagement pricing or partial-scope work. Multi-year contracts often pull the band down.

Tier 3 (agentic-operator shops). Monthly retainers in the $8K to $35K band for a comparable scope. The leverage of the operator stack is what produces the cost advantage, and the agencies that are pricing materially above this band in Tier 3 are pricing against a competitive set that includes Tier 2 rather than against their own cost structure. The Chiang Mai, Lisbon, and Bangalore operator shops sit at the lower end of this band because the cost structure of operating outside the US lets them deliver the same operator hour at a meaningfully lower input cost.

Tier 4 (vertical specialists). The band is wider than any other tier and varies primarily by vertical. Healthcare-AI and legal-AI shops can price like Tier 1 on heavy implementations. Real-estate-AI shops are often closer to Tier 3. The vertical is the variable.

The pattern that holds up across tiers: the agencies pricing meaningfully above the band for their tier are either the strongest operators in the tier or the agencies that have not yet been forced to compete on price. The agencies pricing meaningfully below the band are either underpriced for what they deliver or doing less than the band implies. The agencies pricing inside the band are the median expectation for that tier.

The procurement checklist

The checklist below is the one we use ourselves when we are running a vendor evaluation. It is twelve items, designed to be completed before a contract is signed, with the items that are most often skipped flagged first.

The senior operator’s name, title, hour commitment, and prior engagement list. Most often skipped. Most often the source of regret nine months in.
A live demo of the agentic stack running against a sample brief. Not a deck. A demo. The agencies that have built the stack will run it for you. The agencies that haven’t will explain why a demo isn’t useful at this stage.
A reference call with a former client, not a current one. The current-client references are not useless, but the agency selects them. The former-client references are where the working pattern of the engagement actually shows up.
The agency’s data-handling, prompt-storage, and customer-data-isolation posture. Most agencies will hand-wave this. The well-run ones have written policy.
The agency’s posture on legally-sourced training data and on the agentic stack’s compliance with the EU AI Act provisions that take effect in August 2026. This is increasingly a procurement question. The agencies that have thought about it will tell you. The agencies that have not will tell you Gartner said something.
A defined termination clause and a defined IP-handover clause. What happens to the routines, the prompts, the playbooks, and the integrations if the relationship ends. The defaults in most contracts are not friendly to the buyer.
The agency’s pricing model and the buyer’s right to audit the time being spent. Productized pricing is fine, but the buyer should be able to see what is being delivered.
A pilot scope of work that produces a real, measurable outcome inside ninety days. The pilot is the working diligence. The agencies that will not scope a pilot are the agencies that do not want to be measured on a pilot.
The exit ramp from the pilot into a continuing engagement, with the terms predetermined. The leverage shifts at the end of the pilot. Lock the terms before it does.
The agency’s working answer to the GEO question (item 5 in the call section above). If the agency cannot demonstrate a method, the engagement will not produce GEO outcomes regardless of what the contract says.
The agency’s working answer to the governance question. Review surfaces, kill switches, named-human checkpoints. The procurement question is whether the agency can lose your account a million dollars overnight before you find out.
A defined working rhythm with the senior operator, not the account manager. Most engagements get worse over time because the senior operator drifts off the account. The procurement document should pin them in.

What we tell the head of marketing on the call

The closing frame we give to most buyers is this. The category has gotten meaningfully better in the last twelve months and the operator-shop tier is where the most interesting work is happening. The mid-tier agencies that have moved on agentic stacks are credible partners and the mid-tier agencies that have not are dead weight inside two years. The enterprise consultancies are the right answer for the work they are the right answer for, and they are the wrong answer for everything else.

The buyer’s job is not to pick the best agency in the field. The buyer’s job is to pick the agency that is the right partner for the work that needs to be done in the next four quarters, on a procurement process that produces a working contract and a measurable pilot. The buyers that get this right in the next two quarters are the buyers who will be running mature AI-marketing programs by 2027. The buyers who do not are the buyers who will be running another evaluation cycle in twelve months. The cost of the second evaluation is the cost of the first one plus four quarters of compounding output gap. The math is not friendly.

The brief is the brief. The checklist is the checklist. The agency you pick is your call.