Every premium report we ship leads with a section called "Why Now?" — the part that explains why a trend is surfacing right now: the cultural drivers, the market forces, the key players, the quantified signals. It's the most-read part of the product, and for a long time it was powered by a single vendor: Perplexity.
This is the story of why we tested a replacement, what a fair head-to-head actually showed (including the part where our first conclusion was wrong), and how a billing outage turned a careful experiment into a full migration — without us ever having to scramble.
Why touch something that works?
The honest answer: not because Perplexity was visibly broken, but because we'd spent months making sure we'd never be stuck with it.
Early on we made an architectural bet that every external dependency — search, scraping, LLM calls — should sit behind a port: a small, vendor-neutral interface. The function that writes "Why Now?" doesn't call Perplexity. It calls a research port. Which vendor answers is a configuration detail, swappable per call site from a database row, with a deterministic sample-rate gate so we can ramp traffic from 0% to 100% and watch the metrics move.
Building that machinery is only worth it if you actually use it. So when we wanted to know whether a different stack could do "Why Now?" better, the question wasn't "can we afford to find out?" — it was a config flip and an eval run.
The challenger: Google Search (via SerpAPI) + Claude. Instead of asking one model to both search and synthesize, you split the job — pull the top organic results for the query, hand the snippets to Claude, and have it write the structured analysis, citing results by index. We called it research-context-serpclaude. Same request shape, same response contract, drop-in behind the same port.
The test: same 25 queries, two engines, blind judges
We assembled a golden set of 25 representative queries — fast-moving and slow, consumer and B2B, mainstream and niche — and ran both engines over the identical inputs. Then we graded on two axes:
- Structure — did the engine return the analysis as the structured fields the product needs (drivers, forces, players, signals, each with evidence and a citation), or as one undifferentiated blob of prose?
- Quality — which output was better, judged blind by two independent models from different families (Claude Opus and GPT‑4.1), neither told which engine produced which answer.
The cross-family judging matters: if you let one vendor's model grade its own family's output, you've measured loyalty, not quality.
The plot twist: our first read was wrong
Here's the part worth being honest about. The first time we ran this compare, the SerpAPI/Claude arm looked like the loser. Claude kept returning its analysis as markdown prose, our JSON parser choked, and the structured arrays came back empty. The early report basically said: nice idea, not ready, Perplexity is cleaner — wait.
That conclusion was real, and it was wrong — not about what we measured, but about why. The problem wasn't the model's analysis; it was that we were politely asking for JSON and hoping. The fix was to stop asking and start forcing: a forced tool call that pins Claude to the exact schema, so it physically cannot return free-form prose. We shipped that, re-ran the identical eval — and the result didn't just improve, it inverted.
The lesson we keep relearning: when an LLM "can't do" something structured, check whether you're constraining it or trusting it. Forced tool use turned a failing arm into the winning one overnight.
The head-to-head, post-fix
Same 25 queries, re-run live against both deployed engines:
| Perplexity (incumbent) | SerpAPI + Claude (challenger) | |
|---|---|---|
| Complete structured analysis | 5 / 25 | 25 / 25 |
| Returned an unstructured blob (defect) | 20 / 25 (80%) | 0 |
| Median structured items per report | 0 | 27 |
| Blind quality verdict (who's better) | won 1 / 25 | won 24 / 25 |
| Judge agreement | — | 100%, unanimous |
| Latency per report | ~31s | |
| Cost per report | ~$0.051 |
Two things jumped out.
First — the incumbent was quietly degraded in production. On 80% of queries, Perplexity returned the whole analysis as one long block of text with every structured field empty. Because Perplexity was the live engine, that meant a large share of real premium reports were shipping degraded "Why Now?" sections right then. The eval didn't just compare a challenger; it surfaced a production defect we'd have otherwise kept missing.
Second — the challenger won on the merits, not on a technicality. Two models from different vendors, judging blind, picked SerpAPI/Claude on 24 of 25 queries and agreed on every single one. Zero splits, zero ties. The one query Perplexity won was a case where it happened to return well-formed output. When a Claude judge and a GPT judge agree unanimously against the Claude-powered arm's competitor… well, they agreed for it, unanimously, which is about as un-rigged as a verdict gets.
We didn't hide the trade-offs:
- Speed. The challenger is ~40 seconds slower per report (two steps instead of one). "Why Now?" runs in the background, so this is acceptable — but it's real.
- Thin sub-sections. Three forward-looking sections (timing, risk signals, opportunity windows) sometimes came back sparse, because the model refused to assert what the search snippets didn't support. We'd rather it omit than fabricate — but we made that omit-or-infer behavior a tunable knob rather than a hardcoded rule.
- Cost. About +2¢ per report (~$0.051 → ~$0.072): a bigger structured output plus a search fee. At 10,000 reports/month that's roughly +$200 — immaterial against the quality jump.
- Citations. The two engines cited almost entirely different sources (overlap near zero). Expected — they search different indexes — and not a defect.
The decision — and then the forcing function
The data was decisive, so we recorded the call as an architecture decision: adopt SerpAPI + Claude for "Why Now?", rolled out as a gradual sample-rate ramp with daily monitoring, not a hard cutover. Ship it behind the gate at 0%, ramp as quality holds, keep the old arm as the control.
Then reality intervened. Before we'd finished ramping, the Perplexity account hit its billing quota — every call started returning 401 — you exceeded your current quota. Overnight, the incumbent wasn't degraded; it was down. And it wasn't just "Why Now?": Perplexity also quietly backed trend enrichment, contradiction analysis, and an admin research batch job. All of them started erroring.
This is the moment the architecture earned its keep. Because every one of those consumers already sat behind the same swappable port, the fix wasn't a frantic rewrite — it was the migration we'd already validated, applied everywhere:
- Flip the "Why Now?" gate to 100% SerpAPI/Claude (a config row, no deploy).
- Repoint the standalone research panel to the new engine.
- Extract the search → forced-Claude → derive-citations pipeline into one reusable helper, then point trend enrichment, contradiction analysis, and the batch processor at it — each keeping its own response shape, each a small, tested change.
A quota outage that could have been a multi-day incident became a day of methodical, test-backed migrations, because we'd done the hard part — proving the replacement and building the seams — before we needed it.
Where citations come from now
One principle fell out of all this worth stating plainly. We deliberately did not chase a second LLM that returns its own citations. In the new design, citations aren't something the model hands us — they're derived from the search results we already grounded it on: the model tags each claim with the index of the supporting result, and we map those indices back to URLs. The "second citation source" isn't another vendor's bundled web search; it's the swappable search port itself. That keeps the citation trail honest (it traces to a real result we retrieved) and vendor-independent.
The last decision: keep the loser on the bench
The tidy ending would be "we deleted Perplexity." We didn't — on purpose.
Our final call was to remove Perplexity from every production code path — not just gated off by a config row, but defaulted off in the code itself, so production never touches it regardless of configuration — while keeping the adapter and the old engine in the tree for testing. We still want to be able to run Perplexity head-to-head in the future: as the control arm in our compare harness, as a button in our internal QA console, or by setting a single config row to route test traffic back to it.
A challenger you can't re-run against the incumbent isn't a fair fight anymore. So the incumbent stays on the bench — benched, not buried.
Takeaways
- Build the swap before you need the swap. Ports and a sample-rate gate turned "evaluate a vendor" and "survive a vendor outage" into the same cheap operation.
- Force structure; don't request it. A forced tool call flipped a "failing" model into the winning one. Our first conclusion was wrong because of how we asked, not what we asked.
- Judge blind, and cross-family. Unanimous agreement between models from different vendors is the closest thing to an unbiased quality signal we've found.
- Your eval will find production bugs. Ours surfaced that 80% of live reports were degraded — something no dashboard had flagged.
- Derive citations from what you retrieved. Grounding the model on a search you control beats trusting a model's self-reported sources.
- Don't burn the boats. Removing a vendor from production and deleting it are different decisions. Keep the bench deep enough to re-test.
The numbers in this post are from a fixed 25-query evaluation run, scored blind by Claude Opus and GPT‑4.1, plus live production telemetry. Internal references: the compare findings, ADR 006 (the provider flip), and ADR 007 (the citations decision).
Appendix — How it works under the hood
For the engineers: a breakdown of how the two engines are actually implemented.
The one-line difference
Perplexity is one opaque call — you send a prompt, it searches the web internally and returns prose you hope is JSON. SerpAPI/Claude is a two-step pipeline you control — you run the search yourself, hand the results to Claude, and force it to emit a structured schema.
Call flow
PERPLEXITY (1 call, search hidden inside the vendor)
────────────────────────────────────────────────────
prompt ──► llm.call(perplexityAdapter)
└─ POST api.perplexity.ai/chat/completions (model: sonar-pro)
• Perplexity does its own web search
• returns: content (text) + citations[] (native URLs)
└─ extractJSON(content) ◄── brittle: parse the prose, hope it's JSON
└─ on failure: dump everything into one field
SERPAPI + CLAUDE (2 steps, search is yours)
────────────────────────────────────────────────────
query ──► search.search() [step 1: retrieval]
└─ SerpAPI Google → top 10 organic results
└─ SearchHit[] { url, title, description }
──► llm.call(anthropicAdapter) [step 2: synthesis]
• prompt = base + numbered "SEARCH RESULTS [0]…[9]" block
• tools: [RESEARCH_TOOL] + tool_choice FORCED
• model: claude-sonnet-4-6
└─ returns a schema-conformant tool input (guaranteed structure)
──► deriveCitations(toolOutput, hits) [step 3: citations]
• walk output for source_index ints → searchHits[i].url
Component-by-component
| Perplexity | SerpAPI + Claude | |
|---|---|---|
| Network calls | 1 (bundled) | 2 (search, then synthesize) |
| Who searches | Perplexity, internally, opaque | You — explicit SerpAPI call behind a search port |
| LLM adapter | llm-perplexity.ts, sonar-pro |
llm-anthropic.ts, claude-sonnet-4-6 |
| Structure enforcement | Prompt asks for JSON → extractJSON/repairJSON |
Forced tool (tool_choice pinned) → schema-guaranteed |
| Citations | Native citations: string[] from the vendor |
Derived from source_index → search URLs |
| Recency control | providerHints: { search_recency_filter: "month" } |
Encoded in the search query / grounding strategy |
| Tunables | none beyond the prompt | grounding strategy: strict / relax_inference / more_results / second_search |
| Capabilities declared | ["system_prompt", "citations"] |
["system_prompt", "tool_use"] |
The 4 differences that actually matter
1. Search: hidden vs. owned
Perplexity's search is a black box inside the model — you can't see, swap, or inspect the results. SerpAPI/Claude makes retrieval a first-class step behind its own port:
// SerpAPI/Claude — retrieval is explicit and swappable
const search = buildSearchProvider({ defaultAdapter: makeSearchSerpapiAdapter(...) });
const hits = (await search.search({ query, limit: 10, callSite })).results;
// hits: SearchHit[] { url, title, description }
That's why the search vendor is itself swappable (SerpAPI today, something else tomorrow) — independent of the LLM.
2. Structure: "ask and parse" vs. "force"
This is the difference that flipped the eval. Perplexity was asked, in the prompt, to "format your response as JSON," and then we parsed the prose:
// Perplexity path — hope-it's-JSON
const result = await llm.call({ messages, options:{ providerHints:{ search_recency_filter:"month" } } });
const parsed = extractJSON(result.content); // 3 fallbacks + repairJSON; ~80% failed in prod
When the model's JSON escaping broke, extractJSON threw and the catch-fallback dumped the entire analysis into one field. SerpAPI/Claude replaces "ask and parse" with a forced tool call — the schema is the contract:
// SerpAPI/Claude — schema-guaranteed, no parsing
const result = await llm.call({
messages,
required_capabilities: ["tool_use"],
tools: [RESEARCH_TOOL], // input_schema = the exact output shape
options: { providerHints: { tool_choice: { type:"tool", name: RESEARCH_TOOL_NAME } } },
});
const parsed = result.tool_calls.find(t => t.name === RESEARCH_TOOL_NAME).input; // already structured
The model cannot return free-form prose. No regex, no repair, no fallback-dump.
3. Citations: vendor-reported vs. derived from your search
Perplexity hands you URLs it claims it used (result.citations). SerpAPI/Claude has no native citations — instead, Claude tags each claim with the index of the supporting search result, and we map those indices back to URLs:
// generic, shape-agnostic — walks the whole tool output for source_index ints
function deriveCitations(parsed, hits) {
const seen = new Set();
collectSourceIndices(parsed, seen); // recursive
return [...seen].filter(i => i >= 0 && i < hits.length).map(i => hits[i].url);
}
The citation trail is now honest by construction — every URL traces to a result you actually retrieved, not to a model's self-report. (This is the ADR 007 decision: the "second citation source" is the search port, not another vendor's bundled search.)
4. The call site: same shape, swapped guts
Because both sit behind the same port and return the same contract, a consumer's call site barely changes. The old Perplexity site:
const llm = buildLLMProvider({ functionName, functionalArea, defaultAdapter: perplexityAdapter });
const r = await llm.call({ messages, options:{ providerHints:{ search_recency_filter:"month" } } });
const data = parseEnrichmentResponse(r.content, r.citations ?? []); // brittle regex
The new one delegates to the shared helper with its own schema + mapper:
const data = await runSearchGroundedClaude<EnrichmentData>({
searchQuery, functionName, functionalArea, callSiteSearch, callSiteSynth,
toolSpec: ENRICHMENT_TOOL, // shape-specific forced tool
systemPrompt,
buildUserPrompt: (hits) => appendSearchResults(basePrompt, hits),
mapResponse: (parsed, citations) => mapEnrichmentToolOutput(parsed, citations),
mapFallback: (cleaned, hits) => ({ ...cleaned..., citations: hits.map(h => h.url) }),
});
Failure modes, side by side
| Failure | Perplexity | SerpAPI + Claude |
|---|---|---|
| Model returns malformed JSON | Common → empty arrays, prose dumped in one field (the 80% prod defect) | Impossible — forced tool can't emit prose |
| A section has no supporting evidence | Model may hallucinate to fill it | Omits it (strict) or marks inference (relax) — never fabricates a citation |
| Vendor outage | Hard failure (the quota 401) | Two independent failure points, but each swappable per port |
| Wrong/stale sources | Invisible — can't inspect what it searched | Visible — the SearchHit[] is right there to audit |
What made them interchangeable
The reason the outage migration was a day and not a week: a single generic helper, runSearchGroundedClaude<T>(params, opts), holds the whole SerpAPI→forced-Claude→deriveCitations skeleton. Each consumer supplies only three things — a toolSpec (its output schema), a prompt builder, and a mapResponse/mapFallback pair — and gets a structured, citation-grounded result in its own shape. The original runSerpClaudeResearch is now just one thin wrapper over it; enrichment, contradiction analysis, and the batch job are three more.