The Discovery
Layer for AI

Give your agents access to knowledge they didn't know existed. Discover hidden documents, subdomains, APIs, and relationships across the web — before you crawl.

Start free — no credit card View API docs Talk to sales

1.2M+ pages crawled · 120k+ domains discovered

Discovery

Not just crawling

Live

Web data for agents

JSON

AI-ready extraction

$10/mo

Developer tier

Discover API Crawl API Extract API Graph & Domains API Analytics API Export API Always-on Crawl Realtime Stream Discover API Crawl API Extract API Graph & Domains API Analytics API Export API Always-on Crawl Realtime Stream

— Why it matters

Static data kills production AI.

LLMs hallucinate on stale corpora. RAG breaks when embeddings are weeks old. You need a live structured web layer—not another scraper.

Without freshness

You export a dataset once, embed it, and ship. Two weeks later pricing, partners, and docs changed—but your agent still cites the old world.

With CragData

Plan sources with a niche graph, crawl on demand or on a schedule, extract AI-ready JSON, and deliver via API or webhooks—fresh structured web data for every answer.

— Proven in production tests

Benchmarks & A/B eval, not marketing fluff.

We ran controlled API benches and an A/B study (same LLM, with vs without CragData context). Full methodology, numbers, and honest coverage limits— published for your team to verify.

Read the full validation report Try the API playground

What CragData is

Niche/domain graph from a seed (`/graph/domain-context`)
Prioritized reading list (`/graph/top-pages`)
Structured text for RAG (`/scrape`)

What it isn't

Not a global web search engine
Some domains block scraping (403), redirect to login (302), or are JS-heavy
Scrape-based systems need fallbacks for hard targets

Bench A — RAG ingestion (scrape-friendly domains)

95/95HTTP 200

918 msp90 latency

55/55 (100%)useful scrapes (≥150 words)

~918avg words / scrape

A/B eval — same model, with vs without CragData context

B won 3/3judge verdict

9.00 vs 6.67avg score (B vs A)

Seed: cragsoftware.com · 3 research questions · gpt-4o-mini (answer + judge)

Bench B — harder seeds + URL fallbacks

Useful scrapes 27/60 (45%) · useful seeds 3/4 (75%)

cragsoftware.com15/15 useful scrapes
stripe.com9/15 (some login redirects)
anthropic.com3/15 (302/404 mix)
openai.com0/15 (403 anti-bot)

RAG ingestion quality~918 words/page on average; 55/55 “useful” pages in a controlled bench run.
Operational stability95/95 API calls returned 200; p90 latency under 1s on the startup plan.
Research output qualityA/B eval: CragData-grounded answers won 3/3 (avg score 9.0 vs 6.7).
Honest boundary403-blocked sites are detected—CragData is domain grounding, not “index the whole web.”

Full write-up with integration code, bench design, and reproduction steps →

— The real product

Fresh structured web data
for AI systems.

Scraping is a commodity. The layer that matters is live, structured, citeable web intelligence—so agents and RAG stop hallucinating on stale corpora.

Datasets go stale — pricing, policies, and partners change daily.
LLMs need grounding — JSON + graphs + timestamps, not mystery text dumps.
RAG needs freshness — plan sources, crawl on demand, embed what changed.

Read the docs Try the playground

Static dump

Exported once
No link graph
Unknown scraped_at
RAG drifts in weeks

CragData live layer

On-demand + scheduled crawl
Niche graph per seed
context_for_ai + scraped_at
Webhooks on change

— Architecture

Discover → Crawl → Extract → Structure → Deliver

Managed web intelligence pipeline—not a single-URL scraper script. Built for agents, RAG, and production data products.

01
Discover
Expand domains and URLs from seeds.
02
Crawl
Concurrent fetch with retries and rate limits.
03
Extract
AI-ready JSON—content blocks, links, metadata.
04
Structure
Graph tables, niche scores, top pages.
05
Deliver
REST API, webhooks, JSONL/Parquet export.

Infrastructure underneath

Job queues & orchestration
Automatic retries
Anti-bot resilience
Durable storage
Signed webhooks
Graph + scrape APIs

— Under the hood

Distributed crawling, visualized.

Try a seed domain below—interactive demo graph (no API key). Live crawl, export, and webhooks on your workspace require signup.

1.2M+Pages indexed

120k+Domains discovered

1.4M+Graph edges

~2.4kpages/minPeak throughput

auto-scaleQueue workers

94%Retry success

— How it works

Three steps. Seeds to JSON.

Add seeds, run the pipeline, consume structured data where your team already works.

01 · SEEDS

Add seeds

URLs or domains you care about.

02 · PIPELINE

Run Discover + Crawl + Extract

We map the web slice you defined.

03 · AI CONTEXT

Plan RAG with the graph

Call /graph/domain-context and /graph/top-pages so agents know which domains and URLs to read first.

— What you get

Nine APIs, one pipeline.

Discover, crawl, extract, query, export, and monitor—bundled into plans that scale with your volume.

01 / CORE

AI Context Graph API

GET /graph/domain-context returns your niche subgraph: who links to the seed, who the seed links to, related domains with scores, and summaries agents can consume directly. Pair with /graph/top-pages and /graph/hops.

Discover API

Expand your universe of sites from a handful of seeds. Discover mode surfaces new registrable domains and queues them for crawl—ideal for market maps and competitive landscapes.

Crawl API

Crawl at scale with guardrails. Every page becomes a node; every internal link becomes an edge—ready for graph analysis or downstream extraction.

Extract API

No raw HTML dumps. Get normalized JSON designed for search, LLM pipelines, and analytics—with noise stripped (nav, scripts, footers).

Graph & Domains API

Raw page/domain graph plus stats for dashboards. Use AI Context Graph when you need ranked, agent-ready context—not just nodes and edges.

Analytics API

Monitor pipeline health and coverage from one overview—see what's discovered, crawled, and extracted.

Export API

Download your graph or scrapes as JSONL or Parquet (Developer+). Wire exports into your warehouse or notebooks.

Always-on Crawl

Set seeds once; we keep discovering, deepening, and extracting on a schedule—built for monitoring, not one-off exports.

Realtime stream

Watch crawls live: node events, logs, and progress over a secure WebSocket—ideal for dashboards and internal tools.

— Why CragData

Built for AI systems at scale

Web intelligence infrastructure—not a hobby scraping tool.

Anti-bot resistant

Retries, rate limits, and scrapable flags so agents skip dead pages.

Distributed crawling

Queued jobs scale across workers—not one browser on a laptop.

Fresh web data

On-demand and scheduled crawls keep RAG off stale snapshots.

AI-ready output

Structured JSON and context_for_ai strings for system prompts.

Scalable infra

From 500 free calls/month to enterprise volume and SLAs.

Link graph exploration

Inbound, outbound, and related domains ranked before you embed.

— Use cases

Built for data & GTM teams.

Specific outcomes—not abstract “web data” promises.

GTM / Sales Intelligence

Map 200 competitors in 20 minutes

Seed your top 5 competitors. CragData discovers their ecosystem, extracts messaging, and builds the market map your sales team has been asking for.

SEO & Content

Bulk analyze any niche

Extract titles, headings, word counts, and internal links across 500 domains—without writing a single scraper or managing proxies.

Data / AI Teams

Clean corpora for LLM training

Structured JSON with provenance (URL, status, scraped_at). Feed embeddings and RAG pipelines directly. No HTML parsing, no noise.

Research & Journalism

Reproducible datasets with audit trail

Every page tracked with status code and timestamp. Crawls are reproducible. Pull datasets via API or export—cite it, share it, version it.

Competitive Intelligence

Always-on market monitoring

Continuous loops: discover → crawl → extract on schedule. Get alerted when new sites appear in your vertical.

— Better together

We don't replace your stack. We amplify it.

CragData is the discovery layer that sits upstream of everything else. Add it before you search, before you crawl, before you extract — and every downstream tool gets dramatically richer input.

Exa+CragData=10× web coverage

Exa searches what's indexed. CragData discovers what isn't.

Firecrawl+CragData=Deeper discovery

Firecrawl extracts pages you give it. CragData finds the pages first.

Diffbot+CragData=Richer entities

Diffbot structures known entities. CragData surfaces the hidden ones.

Bright Data+CragData=Full intelligence

Bright Data collects at scale. CragData maps what to collect.

→The value of CragData is not the data it returns — it's the data your agent didn't know existed.

— See the data

Clean JSON, not HTML dumps.

One call returns the niche topology around your seed: inbound/outbound domains ranked by link strength, scores, and a context_for_ai string ready for your system prompt.

// GET /v1/graph/domain-context?seed=ycombinator.com
{
  "seed_domain": "ycombinator.com",
  "context_for_ai": "Niche graph for ycombinator.com (depth 2 hops). 15 destinations, use top_outbound to plan RAG sources.",
  "seed": { "domain": "ycombinator.com", "pages_indexed": 12, "scrapable_pages": 12 },
  "top_outbound_domains": [
    { "domain": "startupschool.org", "link_count": 16, "niche_score": 1.0,   "scrapable": true  },
    { "domain": "paulgraham.com",    "link_count": 6,  "niche_score": 0.375, "scrapable": true  },
    { "domain": "news.ycombinator.com", "link_count": 9, "niche_score": 0.56, "scrapable": true }
  ],
  "top_inbound_domains": [...],
  "related_domains": [...],
  "cached": true
}

— Pricing

Simple plans, bundled credits.

Free for PoCs, Developer from $10/mo, Startup for production pilots, Enterprise for volume and compliance.

Free

Try the full pipeline on a small slice of the web. Perfect for evaluating data quality before you wire up production.

500 API calls / month
1 API key
Discover (up to 25 domains per run)
Dashboard access
7-day data retention
Community support

Create free account

Developer

$10/mo

For serious solo devs and small projects that need real volume—not a $99 commitment.

10,000 API calls / month
3 API keys — dev · staging · prod
Full pipeline: crawl, scrape, discover
Discover up to 200 domains / run
30-day data retention
Email support

Get Developer

Startup

$99/mo

Production volume, webhooks, schedules, and higher limits. For teams shipping a feature, not a science project.

50k API calls / month
5 API keys
Webhooks & scheduled jobs
JSONL / Parquet export
90-day retention
Email support

Get Startup

Enterprise

Custom

Dedicated infrastructure, custom limits, security review, SLAs, and solution design.

Custom volume & dedicated VPS
Custom export & integrations
SSO & audit logs (roadmap)
Custom robots/compliance policies
99.9% SLA option
Named customer success

Book a demo

— FAQ

Questions, answered.

What is CragData?

CragData is web intelligence infrastructure: discover domains, crawl pages, extract structured JSON, and explore link graphs—built for AI agents, RAG, and data products.

What is web intelligence?

Web intelligence is the practice of turning the live web into structured, queryable data—graphs, entities, and documents—rather than one-off HTML snapshots.

What is AI-ready data?

JSON designed for models: clean content blocks, metadata, link graphs, and plain-English summaries like context_for_ai that drop into system prompts.

How should AI agents use CragData?

Call GET /graph/domain-context before broad search, pick domains with GET /graph/top-pages, scrape top URLs, then embed. Patterns are in /docs and /llms.txt.

What is live retrieval for AI agents?

Fetching fresh web data at query time instead of relying on frozen training corpora—so answers reflect current pages, pricing, and ecosystem links.

What is RAG web crawling?

Crawling only the URLs that matter for a user question, after planning sources with a graph—so embeddings stay relevant and token spend stays low.

How does distributed crawling work?

Jobs are queued, workers run concurrent fetches with retries and rate limits, and results land in graph tables and JSON exports you pull via API.

How do anti-bot systems work?

Sites use fingerprints, rate limits, and challenges. We rotate strategies, respect robots.txt, backoff on failures, and surface scrapable flags in graph responses.

Why do AI agents need live web crawling?

Models trained on static snapshots cannot see today's pricing, partners, or news. Live crawl + extract keeps agent answers grounded in current pages.

Are static datasets dying for RAG?

For production agents, yes—freshness wins. Snapshots are fine for eval; customer-facing RAG needs scheduled or on-demand live ingestion.

How is CragData different from a scraper library?

We run discovery, queues, crawling, extraction, storage, and APIs as managed infrastructure—not a single-URL fetch you host yourself.

Do you crawl the entire internet?

No. You provide seeds; we discover and crawl within your plan limits and configuration.

What format is the data?

JSON per page (title, content[], links[], metadata) plus graph endpoints for domains, top pages, and hops.

Can I use my own database?

Yes. Pull via REST, export JSONL or Parquet on Developer+, or push events with webhooks into your warehouse.

Do you offer webhooks?

Yes — crawl.completed, discover.completed, page.extracted. Configure HTTPS URLs in Dashboard → Webhooks; payloads are HMAC-signed.

Is there an official SDK?

Python and Node clients live under packages/cragdata-python and packages/cragdata-js in our GitHub repo.

What is a niche graph?

A ranked view of who links to a seed domain, where the seed links out, and related clusters—with niche_score and scrapable flags for agents.

How fast can I integrate?

Most teams send a first /scrape or /graph/domain-context call within 15 minutes using the quickstart at /docs.

What are your rate limits?

Plan-based requests per second; see /docs#errors or GET /me for live caps. Headers include X-Credits-Remaining and X-RateLimit-Limit.

Do you respect robots.txt?

Yes. You are responsible for lawful use; we encourage robots compliance and reasonable crawl rates.

Is this legal?

You are responsible for how you use data. Enterprise plans include compliance discussions for sensitive use cases.

Can I monitor competitors?

Yes—seed competitor domains, schedule crawls, and diff structured JSON over time. See /use-cases/competitor-monitoring.

How does crawl orchestration with queues work?

POST /crawl or /discover returns a job_id; workers process the queue; you poll GET /crawl/{job_id} or listen on webhooks.

What is structured extraction?

HTML is normalized to JSON blocks with nav/scripts stripped—ready for search indexes, analytics, or embedding pipelines.

CragData vs Firecrawl?

Firecrawl is strong for page-to-markdown. CragData adds niche graphs and agent-first source planning. See /compare/cragdata-vs-firecrawl.

CragData vs Apify?

Apify is an actor marketplace. CragData is an opinionated intelligence API with graphs and managed pipelines. See /compare/cragdata-vs-apify.

What uptime do you target?

Production API at api.cragdata.com with health at /v1/health. Enterprise SLAs available; status practices documented for ops teams.

Do you support always-on crawls?

Yes—Dashboard → Schedules runs recurring discover/crawl jobs for monitoring use cases.

What is context_for_ai?

A plain-English summary in graph responses describing the niche topology—designed to paste into an agent system prompt before retrieval.

Who is CragData built for?

Teams shipping AI agents, RAG products, GTM enrichment, SEO research, and market intelligence who need live structured web data at scale.

Coming soon

Interactive API playground
Public status page with incident history
Zapier / Make connectors

— Get started

Turn websites into
structured data.

Start free, upgrade to Developer for $10/mo, or talk to us about Enterprise volume and compliance.

No credit card for Free tier

Developer tier from $10/mo

Export JSONL / graph on paid plans

No scraper maintenance, ever

Start free — no credit card Get Developer — $10/mo Talk to sales (Enterprise)

— Get in touch

Let's build your
data pipeline.

Questions about plans, volume, or Enterprise? We respond within 24 hours.

Privacy

You control seeds

We process URLs you submit. You control retention and deletion.

Export

JSONL / API

Developer+ export and full REST access

Free tier

500 pages / mo

Evaluate quality before production