Technical Validation for AI Research Teams

Benchmarks and A/B evaluation showing how CragData improves RAG ingestion and agent research—numbers, honest coverage limits, and reproduction steps.

  • validation
  • benchmarks
  • rag
  • ai-agents

Technical validation for AI research & search teams

This article summarizes a hands-on technical validation of CragData as a building block for AI research, RAG ingestion, and domain-focused search pipelines.

Goal: show evidence (numbers + A/B evaluation) that CragData can improve research outcomes for an AI team—without claiming “index the whole web”.

Positioning — what CragData is / isn’t

What it is

CragData is useful as a domain + niche grounding layer:

1. Build a niche/domain graph from a seed (GET /graph/domain-context).

2. Prioritize what to read (GET /graph/top-pages).

3. Extract structured text for RAG (POST /scrape).

What it isn’t

It is not a global web search engine. Some domains are:

  • behind login (302),
  • blocked (403),
  • heavily JS-rendered (low extracted text),

…and any scrape-based system needs fallbacks or alternate sources for those.

Integration pattern (Python)

The core loop we recommend:

1. domain_context(seed)context_for_ai + inbound/outbound domain lists

2. top_pages(domain) → most central internal pages (when available)

3. scrape(url) → structured JSON (title, content[], links[], og, word_count, …)

Example with httpx:

import httpx

API = "https://api.cragdata.com/v1"

def crag_headers():
    return {"Authorization": "Bearer ck_live_YOUR_KEY"}

def domain_context(client: httpx.Client, seed: str, auto_acquire: bool = True) -> dict:
    r = client.get(
        f"{API}/graph/domain-context",
        params={"seed": seed, "auto_acquire": str(auto_acquire).lower()},
        headers=crag_headers(),
    )
    r.raise_for_status()
    return r.json()

def top_pages(client: httpx.Client, domain: str, limit: int = 10) -> dict:
    r = client.get(
        f"{API}/graph/top-pages",
        params={"domain": domain, "limit": limit},
        headers=crag_headers(),
    )
    r.raise_for_status()
    return r.json()

def scrape_url(client: httpx.Client, url: str) -> dict:
    r = client.post(
        f"{API}/scrape",
        json={"url": url},
        headers=crag_headers(),
    )
    r.raise_for_status()
    return r.json()

Use context_for_ai in your agent system prompt, then scrape the top URLs returned by top_pages.

Bench testing — throughput + usefulness

We ran a bench mode that:

  • executes many seeds/rounds,
  • calls domain-context and top-pages,
  • scrapes top candidates,
  • stores raw JSON artifacts,
  • computes latency percentiles, rate limits, useful scrape rate (text-size proxy), per-seed breakdown, and worst URLs (empty/blocked/redirect).

Bench A — “happy path” RAG ingestion

Controlled run on scrape-friendly domains (startup plan).

Headline results:

  • Requests: 95
  • API status: 95/95 HTTP 200
  • Latency: p50 301 ms, p90 918 ms, p99 1482 ms, max 2227 ms
  • Scrapes attempted: 55 — rate limit events: 0
  • Useful scrape threshold: ≥ 150 words
  • Useful scrapes: 55/55 (100%)
  • Extracted text: average ~917.5 words/scrape (min 543, max 1594)

Interpretation: when a site is scrape-friendly and top-pages returns good candidates, CragData delivers high-density text fast enough for RAG pipelines and research agents.

Bench B — coverage across harder seeds

Same bench with a fallback URL strategy when top-pages is empty (/docs, /blog, /pricing, /solutions, /customers, /sitemap.xml, …).

Headline results:

  • Useful scrapes: 27/60 (45%)
  • Useful seed coverage: 3/4 (75%)

Per-seed highlights:

  • cragsoftware.com — 15/15 useful scrapes
  • stripe.com — 9/15 useful (some URLs redirect to login, e.g. dashboard)
  • anthropic.com — 3/15 useful (mix of 302/404 + some content-rich pages)
  • openai.com — 0/15 useful (403 blocking / anti-bot)

Interpretation: CragData is operationally stable, but coverage depends on the target domain. For blocked sites the right product behavior is detect + classify and route to alternate strategies (other domains, cached sources, sanctioned APIs, or rendering).

A/B evaluation — does it improve agent research?

We ran an A/B eval:

  • A (baseline): answer the research question with no CragData context.
  • B (with CragData): answer with context_for_ai + inbound/outbound lists + scraped snippets.
  • A judge model scores both answers (0–10) and picks a winner.

Configuration: seed cragsoftware.com, answer model gpt-4o-mini, judge model gpt-4o-mini, 3 questions.

Results:

  • Winners: B won 3/3
  • Average judge score: A 6.67 vs B 9.00

Interpretation: when the agent receives domain-grounded context plus relevant scraped pages, answers become more specific, more actionable, and better grounded—less generic filler.

Example (question 1)

Question: “What are the top 3 capabilities offered, and how would I evaluate quality/risk?”

  • Baseline (A) stayed generic (“innovation / delivery / customer support”).
  • With CragData (B) the agent listed concrete capabilities from the site: machine learning solutions, data analytics & dashboards, web scraping.

What to say in a sales conversation

1. RAG ingestion quality: “~918 words/page on average; 55/55 useful pages in a controlled bench.”

2. Operational stability: “95/95 API calls returned 200; p90 latency under 1s.”

3. Research quality: “A/B eval: CragData-grounded answers won 3/3 (9.0 vs 6.7 average score).”

4. Honest boundary: “Some sites return 403—we detect that. CragData is domain grounding, not crawl-the-entire-web.”

Reproduce on your stack

Use the same three endpoints from the API docs or playground:

curl "https://api.cragdata.com/v1/graph/domain-context?seed=YOUR_DOMAIN" \
  -H "Authorization: Bearer $CRAGDATA_API_KEY"

curl "https://api.cragdata.com/v1/graph/top-pages?domain=YOUR_DOMAIN&limit=8" \
  -H "Authorization: Bearer $CRAGDATA_API_KEY"

curl -X POST https://api.cragdata.com/v1/scrape \
  -H "Authorization: Bearer $CRAGDATA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://YOUR_DOMAIN/"}'

Run your own A/B eval by injecting context_for_ai + 3–5 scraped snippets into the system prompt, then score answers with your judge model.

Next steps