2025-10-06T15:30:00.000Z

How to use Common Crawl to find your website

Common Crawl archives billions of web pages and makes them freely available. Here's how to check if your site is indexed and extract the content.

Check if your site is indexed

curl -s "http://index.commoncrawl.org/CC-MAIN-2024-42-index?url=kevinsimper.dk&output=json" | head -1

This returns JSON with your page details. Count how many pages are indexed:

curl -s "http://index.commoncrawl.org/CC-MAIN-2024-42-index?url=kevinsimper.dk/*&output=json" | wc -l
# 82 pages from my site

Two ways to get content

Method 1: WARC files (precise access)

WARC files contain full HTML. The index tells you exactly where your content is:

# Get location info
curl -s "http://index.commoncrawl.org/CC-MAIN-2024-42-index?url=kevinsimper.dk&output=json" | \
  jq -r 'select(.status=="200") | "offset: \(.offset), length: \(.length), file: \(.filename)"'
# offset: 287741401, length: 16652, file: crawl-data/.../warc/CC-MAIN-...warc.gz

# Fetch exact bytes
curl -H "Range: bytes=287741401-287758052" \
  "https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-42/segments/1727944253525.17/warc/CC-MAIN-20241008044807-20241008074807-00461.warc.gz" | \
  gunzip

Method 2: WET files (text only, but no index)

WET files contain plain text from ALL websites in that segment. There's no index for WET files, so you download the entire 82MB file:

# Convert WARC path to WET path (warc→wet, .warc.gz→.warc.wet.gz)
WET_FILE="crawl-data/CC-MAIN-2024-42/segments/1727944253525.17/wet/CC-MAIN-20241008044807-20241008074807-00461.warc.wet.gz"

# Download entire file and search
curl -s "https://data.commoncrawl.org/$WET_FILE" | \
  gunzip | grep -A50 "WARC-Target-URI: https://kevinsimper.dk"

The key difference

WARC: Has an index with byte offsets → download only what you need (17KB)
WET: No index → download entire segment file containing 27,000+ sites (82MB)

Use WARC when you need specific pages. Use WET when you're analyzing many sites at once.

Finding the latest crawl

New crawls monthly at https://data.commoncrawl.org/

CC-MAIN-2025-38 (September 2025)
CC-MAIN-2025-33 (August 2025)

My site had 82 pages in the October crawl. Pretty good coverage!

KevinSimper