Building RAG With Live Web Data
A concrete pipeline for RAG with niche graphs, top pages, structured scrape JSON, and embeddings.
Building RAG with live web data
This is the pipeline we recommend for teams shipping retrieval that respects token budgets and freshness.
Step 1 — Plan sources with a graph
curl "https://api.cragdata.com/v1/graph/domain-context?seed=indiehackers.com" \
-H "Authorization: Bearer $CRAGDATA_API_KEY"
Read context_for_ai and ranked top_inbound_domains / top_outbound_domains.
Step 2 — Pick pages inside the best domain
curl "https://api.cragdata.com/v1/graph/top-pages?domain=indiehackers.com&limit=10" \
-H "Authorization: Bearer $CRAGDATA_API_KEY"
Step 3 — Scrape structured JSON
curl -X POST https://api.cragdata.com/v1/scrape \
-H "Authorization: Bearer $CRAGDATA_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://indiehackers.com/post/example"}'
Step 4 — Chunk and embed
Use content[] blocks as chunks. Store URL, title, and fetch time for citations in the final answer.
Operations
- Schedule recurring crawls for monitoring
- Use webhooks instead of polling when possible
- Export JSONL to your warehouse on Developer+
Full reference: documentation and llms.txt.