Foxley factuality benchmark

When you ask Foxley how your ads are doing, you act on the answer. Foxley reads the same Meta Marketing API that powers your Ads Manager — the difference is the harness on top of it: choosing the right metric, scoping to the right campaigns, and checking the numbers before they reach you. MktF1 measures how much that harness matters for getting a factual answer. This page covers what it measures, how we ran it, and the results.

What we measured

Factuality has four parts. For a given question about a real account, a good answer:

picks the right metric (link CPC, not all-clicks CPC, for a traffic campaign),
reports the right number (matching the source data),
uses the right scope (your campaigns in the window, not lifetime or a stale subset),
and invents nothing — no fabricated figures, no manufactured problems.

How we tested

Pick a real account

A fast-growing advocacy organization with a month of delivery across seven campaigns — no synthetic data.

Establish ground truth

We pulled the true figures from the Meta Marketing API for the exact date window, before running any surface.

Ask 12 questions, two ways

From “what was my total spend?” to “how did my campaigns do?”. We asked each question once naming the exact metric, and once the way a marketer actually phrases it.

Score every answer

Each answer was graded 1–5 on accuracy, completeness, actionability, and trust, against the ground truth.

Asking each question both ways is the point. Naming the metric (“what was the link CTR?”) lets a surface fetch a field. Phrasing it like a user (“which campaign had the best CPC?”) forces it to choose and correct the metric. The gap between those two scores is how we measure robustness.

Results

We compared three ways to get an answer from the same Meta data: Foxley (the harness), Claude (Opus 4.8) with Meta’s official Ads MCP (the same model on Meta’s own, thinner tool layer), and Meta AI (Meta’s built-in assistant). Scored against Meta’s own API data across the 12 questions:

Getting the answer via	Factuality score
Foxley	99.4%
Claude (Opus 4.8) with Meta’s official Ads MCP	91%
Meta AI	59.6%

Foxley got every number right across all 12 questions, with no fabrications. That holds whether you use the in-app chat or connect to the MCP suite — same tools, same result.

June 2026 update: we replicated the results with Fable 5, Anthropic’s newest flagship model — Foxley scored 100% vs. Meta’s official Ads MCP at 94%. The small shifts from the Opus run are run-to-run variance, not a model effect; the tool harness is what determines factuality.

Where the gap comes from

All three read the same data. The difference is the tool layer between the model and the numbers.

Without metric guidance. Meta’s official Ads MCP has no native link-click cost or rate field, so the same model (Claude Opus 4.8) returned all-clicks numbers for “best CPC” and named the wrong winner — the right number for the wrong metric.
Without guardrails. Meta AI invented metric values — including campaign costs off by more than 10× — and held them steady across follow-up questions. It also flagged healthy campaigns as problems and pulled in an ad from years outside the window.

These are the errors a harness exists to catch.

What keeps Foxley steady

When we stopped naming the metric and asked the way a marketer would, Foxley stayed effectively flat (99.2% → 99.6% across the two phrasings). The other surfaces dropped — Claude with Meta’s MCP by about 18 points, Meta AI by about 11. The reason is the same one behind the CPC miss above: Foxley chooses and corrects the metric for the question. For a traffic campaign it reports link CPC and says so, instead of returning whichever field was named.

Beyond the scored questions

Outside the 12 scored questions, while using Claude (Opus 4.8) with Meta’s official Ads MCP, we ran into four behaviors worth flagging — the kind of defaults a harness is built to prevent. They varied a lot with how carefully the tool was driven:

No recency default. Asked “how did my campaigns do,” it expanded to three years of lifetime history (~50 campaigns) and surfaced unrelated older campaigns — instead of the recent flight a marketer means.
Time lost to unproductive calls. It cycled through context and trend tools that returned no performance data before finding the numbers — at one point noting it was “going in circles.”
All-clicks CPC, not link CPC. Asked for the best CPC, it reported cost per all clicks (~ $1.60–2.00) instead of cost per *link* click (~$ 4–5) — the metric that matters for a traffic campaign.
Summed reach instead of deduplicated reach. Asked for total reach, it added per-campaign reach — double-counting people reached by more than one campaign — instead of the deduplicated account total.

This is a snapshot: one account, one month, 12 questions. All three surfaces read the same Meta Marketing API — the benchmark measures the harness, not the underlying data. Surfaces change over time, so we re-run it and publish updates. MktF1 covers factual accuracy only; diagnosis and strategy are separate benchmarks we report on as they mature.

Run it on your own account

Ask Foxley the same questions about your campaigns inside the Outfox app, or connect an MCP client to the MCP endpoint at https://app.outfox.ai/api/mcp and ask from there. The Meta analytics tools are the ones under test.

​What we measured

​How we tested

​Results

​Where the gap comes from

​What keeps Foxley steady

​Beyond the scored questions

​Run it on your own account

What we measured

How we tested

Results

Where the gap comes from

What keeps Foxley steady

Beyond the scored questions

Run it on your own account