The Discovery
Layer for AI

Give your agents access to knowledge they didn't know existed. Discover hidden documents, subdomains, APIs, and relationships across the web — before you crawl.

1.2M+ pages crawled · 120k+ domains discovered

Discovery
Not just crawling
Live
Web data for agents
JSON
AI-ready extraction
$10/mo
Developer tier
Discover API Crawl API Extract API Graph & Domains API Analytics API Export API Always-on Crawl Realtime Stream Discover API Crawl API Extract API Graph & Domains API Analytics API Export API Always-on Crawl Realtime Stream
— Why it matters

Static data kills production AI.

LLMs hallucinate on stale corpora. RAG breaks when embeddings are weeks old. You need a live structured web layer—not another scraper.

Without freshness

You export a dataset once, embed it, and ship. Two weeks later pricing, partners, and docs changed—but your agent still cites the old world.

With CragData

Plan sources with a niche graph, crawl on demand or on a schedule, extract AI-ready JSON, and deliver via API or webhooks—fresh structured web data for every answer.

— Proven in production tests

Benchmarks & A/B eval, not marketing fluff.

We ran controlled API benches and an A/B study (same LLM, with vs without CragData context). Full methodology, numbers, and honest coverage limits— published for your team to verify.

What CragData is

  • Niche/domain graph from a seed (`/graph/domain-context`)
  • Prioritized reading list (`/graph/top-pages`)
  • Structured text for RAG (`/scrape`)

What it isn't

  • Not a global web search engine
  • Some domains block scraping (403), redirect to login (302), or are JS-heavy
  • Scrape-based systems need fallbacks for hard targets
Bench A — RAG ingestion (scrape-friendly domains)
95/95HTTP 200
918 msp90 latency
55/55 (100%)useful scrapes (≥150 words)
~918avg words / scrape
A/B eval — same model, with vs without CragData context
B won 3/3judge verdict
9.00 vs 6.67avg score (B vs A)

Seed: cragsoftware.com · 3 research questions · gpt-4o-mini (answer + judge)

Bench B — harder seeds + URL fallbacks

Useful scrapes 27/60 (45%) · useful seeds 3/4 (75%)

  • cragsoftware.com15/15 useful scrapes
  • stripe.com9/15 (some login redirects)
  • anthropic.com3/15 (302/404 mix)
  • openai.com0/15 (403 anti-bot)
  • RAG ingestion quality~918 words/page on average; 55/55 “useful” pages in a controlled bench run.
  • Operational stability95/95 API calls returned 200; p90 latency under 1s on the startup plan.
  • Research output qualityA/B eval: CragData-grounded answers won 3/3 (avg score 9.0 vs 6.7).
  • Honest boundary403-blocked sites are detected—CragData is domain grounding, not “index the whole web.”

Full write-up with integration code, bench design, and reproduction steps →

— The real product

Fresh structured web data
for AI systems.

Scraping is a commodity. The layer that matters is live, structured, citeable web intelligence—so agents and RAG stop hallucinating on stale corpora.

  • Datasets go stale — pricing, policies, and partners change daily.
  • LLMs need grounding — JSON + graphs + timestamps, not mystery text dumps.
  • RAG needs freshness — plan sources, crawl on demand, embed what changed.

Static dump

  • Exported once
  • No link graph
  • Unknown scraped_at
  • RAG drifts in weeks

CragData live layer

  • On-demand + scheduled crawl
  • Niche graph per seed
  • context_for_ai + scraped_at
  • Webhooks on change
— Architecture

Discover → Crawl → Extract → Structure → Deliver

Managed web intelligence pipeline—not a single-URL scraper script. Built for agents, RAG, and production data products.

  1. 01

    Discover

    Expand domains and URLs from seeds.

  2. 02

    Crawl

    Concurrent fetch with retries and rate limits.

  3. 03

    Extract

    AI-ready JSON—content blocks, links, metadata.

  4. 04

    Structure

    Graph tables, niche scores, top pages.

  5. 05

    Deliver

    REST API, webhooks, JSONL/Parquet export.

Infrastructure underneath

  • Job queues & orchestration
  • Automatic retries
  • Anti-bot resilience
  • Durable storage
  • Signed webhooks
  • Graph + scrape APIs
— Under the hood

Distributed crawling, visualized.

Try a seed domain below—interactive demo graph (no API key). Live crawl, export, and webhooks on your workspace require signup.

1.2M+Pages indexed
120k+Domains discovered
1.4M+Graph edges
~2.4kpages/minPeak throughput
auto-scaleQueue workers
94%Retry success
— How it works

Three steps. Seeds to JSON.

Add seeds, run the pipeline, consume structured data where your team already works.

01 · SEEDS

Add seeds

URLs or domains you care about.

02 · PIPELINE

Run Discover + Crawl + Extract

We map the web slice you defined.

03 · AI CONTEXT

Plan RAG with the graph

Call /graph/domain-context and /graph/top-pages so agents know which domains and URLs to read first.

— What you get

Nine APIs, one pipeline.

Discover, crawl, extract, query, export, and monitor—bundled into plans that scale with your volume.

02

Discover API

Expand your universe of sites from a handful of seeds. Discover mode surfaces new registrable domains and queues them for crawl—ideal for market maps and competitive landscapes.

03

Crawl API

Crawl at scale with guardrails. Every page becomes a node; every internal link becomes an edge—ready for graph analysis or downstream extraction.

04

Extract API

No raw HTML dumps. Get normalized JSON designed for search, LLM pipelines, and analytics—with noise stripped (nav, scripts, footers).

05

Graph & Domains API

Raw page/domain graph plus stats for dashboards. Use AI Context Graph when you need ranked, agent-ready context—not just nodes and edges.

06

Analytics API

Monitor pipeline health and coverage from one overview—see what's discovered, crawled, and extracted.

07

Export API

Download your graph or scrapes as JSONL or Parquet (Developer+). Wire exports into your warehouse or notebooks.

08

Always-on Crawl

Set seeds once; we keep discovering, deepening, and extracting on a schedule—built for monitoring, not one-off exports.

09

Realtime stream

Watch crawls live: node events, logs, and progress over a secure WebSocket—ideal for dashboards and internal tools.

— Why CragData

Built for AI systems at scale

Web intelligence infrastructure—not a hobby scraping tool.

Anti-bot resistant

Retries, rate limits, and scrapable flags so agents skip dead pages.

Distributed crawling

Queued jobs scale across workers—not one browser on a laptop.

Fresh web data

On-demand and scheduled crawls keep RAG off stale snapshots.

AI-ready output

Structured JSON and context_for_ai strings for system prompts.

Scalable infra

From 500 free calls/month to enterprise volume and SLAs.

Link graph exploration

Inbound, outbound, and related domains ranked before you embed.

— Use cases

Built for data & GTM teams.

Specific outcomes—not abstract “web data” promises.

GTM / Sales Intelligence

Map 200 competitors in 20 minutes

Seed your top 5 competitors. CragData discovers their ecosystem, extracts messaging, and builds the market map your sales team has been asking for.

SEO & Content

Bulk analyze any niche

Extract titles, headings, word counts, and internal links across 500 domains—without writing a single scraper or managing proxies.

Data / AI Teams

Clean corpora for LLM training

Structured JSON with provenance (URL, status, scraped_at). Feed embeddings and RAG pipelines directly. No HTML parsing, no noise.

Research & Journalism

Reproducible datasets with audit trail

Every page tracked with status code and timestamp. Crawls are reproducible. Pull datasets via API or export—cite it, share it, version it.

Competitive Intelligence

Always-on market monitoring

Continuous loops: discover → crawl → extract on schedule. Get alerted when new sites appear in your vertical.

— Better together

We don't replace your stack. We amplify it.

CragData is the discovery layer that sits upstream of everything else. Add it before you search, before you crawl, before you extract — and every downstream tool gets dramatically richer input.

Exa+CragData=10× web coverage

Exa searches what's indexed. CragData discovers what isn't.

Firecrawl+CragData=Deeper discovery

Firecrawl extracts pages you give it. CragData finds the pages first.

Diffbot+CragData=Richer entities

Diffbot structures known entities. CragData surfaces the hidden ones.

Bright Data+CragData=Full intelligence

Bright Data collects at scale. CragData maps what to collect.

The value of CragData is not the data it returns — it's the data your agent didn't know existed.
— See the data

Clean JSON, not HTML dumps.

One call returns the niche topology around your seed: inbound/outbound domains ranked by link strength, scores, and a context_for_ai string ready for your system prompt.

// GET /v1/graph/domain-context?seed=ycombinator.com
{
  "seed_domain": "ycombinator.com",
  "context_for_ai": "Niche graph for ycombinator.com (depth 2 hops). 15 destinations, use top_outbound to plan RAG sources.",
  "seed": { "domain": "ycombinator.com", "pages_indexed": 12, "scrapable_pages": 12 },
  "top_outbound_domains": [
    { "domain": "startupschool.org", "link_count": 16, "niche_score": 1.0,   "scrapable": true  },
    { "domain": "paulgraham.com",    "link_count": 6,  "niche_score": 0.375, "scrapable": true  },
    { "domain": "news.ycombinator.com", "link_count": 9, "niche_score": 0.56, "scrapable": true }
  ],
  "top_inbound_domains": [...],
  "related_domains": [...],
  "cached": true
}
— Pricing

Simple plans, bundled credits.

Free for PoCs, Developer from $10/mo, Startup for production pilots, Enterprise for volume and compliance.

Free
$0

Try the full pipeline on a small slice of the web. Perfect for evaluating data quality before you wire up production.

  • 500 API calls / month
  • 1 API key
  • Discover (up to 25 domains per run)
  • Dashboard access
  • 7-day data retention
  • Community support
Create free account
Startup
$99/mo

Production volume, webhooks, schedules, and higher limits. For teams shipping a feature, not a science project.

  • 50k API calls / month
  • 5 API keys
  • Webhooks & scheduled jobs
  • JSONL / Parquet export
  • 90-day retention
  • Email support
Get Startup
Enterprise
Custom

Dedicated infrastructure, custom limits, security review, SLAs, and solution design.

  • Custom volume & dedicated VPS
  • Custom export & integrations
  • SSO & audit logs (roadmap)
  • Custom robots/compliance policies
  • 99.9% SLA option
  • Named customer success
Book a demo
— FAQ

Questions, answered.

What is CragData?

CragData is web intelligence infrastructure: discover domains, crawl pages, extract structured JSON, and explore link graphs—built for AI agents, RAG, and data products.

What is web intelligence?

Web intelligence is the practice of turning the live web into structured, queryable data—graphs, entities, and documents—rather than one-off HTML snapshots.

What is AI-ready data?

JSON designed for models: clean content blocks, metadata, link graphs, and plain-English summaries like context_for_ai that drop into system prompts.

How should AI agents use CragData?

Call GET /graph/domain-context before broad search, pick domains with GET /graph/top-pages, scrape top URLs, then embed. Patterns are in /docs and /llms.txt.

What is live retrieval for AI agents?

Fetching fresh web data at query time instead of relying on frozen training corpora—so answers reflect current pages, pricing, and ecosystem links.

What is RAG web crawling?

Crawling only the URLs that matter for a user question, after planning sources with a graph—so embeddings stay relevant and token spend stays low.

How does distributed crawling work?

Jobs are queued, workers run concurrent fetches with retries and rate limits, and results land in graph tables and JSON exports you pull via API.

How do anti-bot systems work?

Sites use fingerprints, rate limits, and challenges. We rotate strategies, respect robots.txt, backoff on failures, and surface scrapable flags in graph responses.

Why do AI agents need live web crawling?

Models trained on static snapshots cannot see today's pricing, partners, or news. Live crawl + extract keeps agent answers grounded in current pages.

Are static datasets dying for RAG?

For production agents, yes—freshness wins. Snapshots are fine for eval; customer-facing RAG needs scheduled or on-demand live ingestion.

How is CragData different from a scraper library?

We run discovery, queues, crawling, extraction, storage, and APIs as managed infrastructure—not a single-URL fetch you host yourself.

Do you crawl the entire internet?

No. You provide seeds; we discover and crawl within your plan limits and configuration.

What format is the data?

JSON per page (title, content[], links[], metadata) plus graph endpoints for domains, top pages, and hops.

Can I use my own database?

Yes. Pull via REST, export JSONL or Parquet on Developer+, or push events with webhooks into your warehouse.

Do you offer webhooks?

Yes — crawl.completed, discover.completed, page.extracted. Configure HTTPS URLs in Dashboard → Webhooks; payloads are HMAC-signed.

Is there an official SDK?

Python and Node clients live under packages/cragdata-python and packages/cragdata-js in our GitHub repo.

What is a niche graph?

A ranked view of who links to a seed domain, where the seed links out, and related clusters—with niche_score and scrapable flags for agents.

How fast can I integrate?

Most teams send a first /scrape or /graph/domain-context call within 15 minutes using the quickstart at /docs.

What are your rate limits?

Plan-based requests per second; see /docs#errors or GET /me for live caps. Headers include X-Credits-Remaining and X-RateLimit-Limit.

Do you respect robots.txt?

Yes. You are responsible for lawful use; we encourage robots compliance and reasonable crawl rates.

Is this legal?

You are responsible for how you use data. Enterprise plans include compliance discussions for sensitive use cases.

Can I monitor competitors?

Yes—seed competitor domains, schedule crawls, and diff structured JSON over time. See /use-cases/competitor-monitoring.

How does crawl orchestration with queues work?

POST /crawl or /discover returns a job_id; workers process the queue; you poll GET /crawl/{job_id} or listen on webhooks.

What is structured extraction?

HTML is normalized to JSON blocks with nav/scripts stripped—ready for search indexes, analytics, or embedding pipelines.

CragData vs Firecrawl?

Firecrawl is strong for page-to-markdown. CragData adds niche graphs and agent-first source planning. See /compare/cragdata-vs-firecrawl.

CragData vs Apify?

Apify is an actor marketplace. CragData is an opinionated intelligence API with graphs and managed pipelines. See /compare/cragdata-vs-apify.

What uptime do you target?

Production API at api.cragdata.com with health at /v1/health. Enterprise SLAs available; status practices documented for ops teams.

Do you support always-on crawls?

Yes—Dashboard → Schedules runs recurring discover/crawl jobs for monitoring use cases.

What is context_for_ai?

A plain-English summary in graph responses describing the niche topology—designed to paste into an agent system prompt before retrieval.

Who is CragData built for?

Teams shipping AI agents, RAG products, GTM enrichment, SEO research, and market intelligence who need live structured web data at scale.

Coming soon

  • Interactive API playground
  • Public status page with incident history
  • Zapier / Make connectors
— Get started

Turn websites into
structured data.

Start free, upgrade to Developer for $10/mo, or talk to us about Enterprise volume and compliance.

No credit card for Free tier
Developer tier from $10/mo
Export JSONL / graph on paid plans
No scraper maintenance, ever
— Get in touch

Let's build your
data pipeline.

Questions about plans, volume, or Enterprise? We respond within 24 hours.

Privacy
You control seeds
We process URLs you submit. You control retention and deletion.
Export
JSONL / API
Developer+ export and full REST access
Free tier
500 pages / mo
Evaluate quality before production
Enterprise
Custom SLA
Dedicated VPS & compliance review
Name
Email
Company
Plan interest
Message

About CragData

CragData is a web crawl API and link graph service for AI agents and RAG pipelines. It crawls a seed domain, maps inbound/outbound links, and returns a niche topology graph with a context_for_ai string ready for system prompts—enabling agents to plan research sources before broad web search. Outputs: structured JSON, domain graphs, and scrapable page lists. Free tier: 500 calls/month, no credit card.

Product
CragData — crawl, extract, and graph APIs for structured web intelligence
Primary endpoint
GET /graph/domain-context — niche topology + context_for_ai string
Machine-readable spec
https://cragdata.com/llms.txt
Auth
Authorization: Bearer ck_live_* API key from dashboard
Free tier
500 API calls/month, no credit card
Developer tier
$10/month — 10,000 calls/month
Parent company
Crag Group