What we measured
Factuality has four parts. For a given question about a real account, a good answer:- picks the right metric (link CPC, not all-clicks CPC, for a traffic campaign),
- reports the right number (matching the source data),
- uses the right scope (your campaigns in the window, not lifetime or a stale subset),
- and invents nothing — no fabricated figures, no manufactured problems.
How we tested
Pick a real account
A fast-growing advocacy organization with a month of delivery across seven campaigns — no synthetic data.
Establish ground truth
We pulled the true figures from the Meta Marketing API for the exact date window, before running any surface.
Ask 12 questions, two ways
From “what was my total spend?” to “how did my campaigns do?”. We asked each question once naming the exact metric, and once the way a marketer actually phrases it.
Asking each question both ways is the point. Naming the metric (“what was the link CTR?”) lets a surface fetch a field. Phrasing it like a user (“which campaign had the best CPC?”) forces it to choose and correct the metric. The gap between those two scores is how we measure robustness.
Results
We compared three ways to get an answer from the same Meta data: Foxley (the harness), Claude (Opus 4.8) with Meta’s official Ads MCP (the same model on Meta’s own, thinner tool layer), and Meta AI (Meta’s built-in assistant). Scored against Meta’s own API data across the 12 questions:| Getting the answer via | Factuality score |
|---|---|
| Foxley | 99.4% |
| Claude (Opus 4.8) with Meta’s official Ads MCP | 91% |
| Meta AI | 59.6% |
June 2026 update: we replicated the results with Fable 5, Anthropic’s newest flagship model — Foxley scored 100% vs. Meta’s official Ads MCP at 94%. The small shifts from the Opus run are run-to-run variance, not a model effect; the tool harness is what determines factuality.
Where the gap comes from
All three read the same data. The difference is the tool layer between the model and the numbers.- Without metric guidance. Meta’s official Ads MCP has no native link-click cost or rate field, so the same model (Claude Opus 4.8) returned all-clicks numbers for “best CPC” and named the wrong winner — the right number for the wrong metric.
- Without guardrails. Meta AI invented metric values — including campaign costs off by more than 10× — and held them steady across follow-up questions. It also flagged healthy campaigns as problems and pulled in an ad from years outside the window.
What keeps Foxley steady
When we stopped naming the metric and asked the way a marketer would, Foxley stayed effectively flat (99.2% → 99.6% across the two phrasings). The other surfaces dropped — Claude with Meta’s MCP by about 18 points, Meta AI by about 11. The reason is the same one behind the CPC miss above: Foxley chooses and corrects the metric for the question. For a traffic campaign it reports link CPC and says so, instead of returning whichever field was named.Beyond the scored questions
Outside the 12 scored questions, while using Claude (Opus 4.8) with Meta’s official Ads MCP, we ran into four behaviors worth flagging — the kind of defaults a harness is built to prevent. They varied a lot with how carefully the tool was driven:- No recency default. Asked “how did my campaigns do,” it expanded to three years of lifetime history (~50 campaigns) and surfaced unrelated older campaigns — instead of the recent flight a marketer means.
- Time lost to unproductive calls. It cycled through context and trend tools that returned no performance data before finding the numbers — at one point noting it was “going in circles.”
- All-clicks CPC, not link CPC. Asked for the best CPC, it reported cost per all clicks (~4–5) — the metric that matters for a traffic campaign.
- Summed reach instead of deduplicated reach. Asked for total reach, it added per-campaign reach — double-counting people reached by more than one campaign — instead of the deduplicated account total.
This is a snapshot: one account, one month, 12 questions. All three surfaces read the same Meta Marketing API — the benchmark measures the harness, not the underlying data. Surfaces change over time, so we re-run it and publish updates. MktF1 covers factual accuracy only; diagnosis and strategy are separate benchmarks we report on as they mature.
Run it on your own account
Ask Foxley the same questions about your campaigns inside the Outfox app, or connect an MCP client to the MCP endpoint athttps://app.outfox.ai/api/mcp and ask from there. The Meta analytics tools are the ones under test.