Skip to main content
Crawl submits a URL to Firecrawl and recursively discovers and scrapes every reachable subpage. It handles sitemaps, JavaScript rendering, and rate limits automatically, returning clean markdown or structured data for each page.
  • Discovers pages via sitemap and recursive link traversal
  • Supports path filtering, depth limits, and subdomain/external link control
  • Returns results via polling, WebSocket, or webhook

Try it in the Playground

Test crawling in the interactive playground — no code required.

Installation

# pip install firecrawl-py

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

Basic usage

Submit a crawl job by calling POST /v2/crawl with a starting URL. The endpoint returns a job ID that you use to poll for results.
from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

docs = firecrawl.crawl(url="https://docs.firecrawl.dev", limit=10)
print(docs)
Each page crawled consumes 1 credit. The default crawl limit is 10,000 pages. Set a lower limit to control credit usage (e.g. limit: 100). Additional credits apply for certain options: JSON mode costs 4 additional credits per page, enhanced proxy costs 4 additional credits per page, and PDF parsing costs 1 credit per PDF page.

Scrape options

All options from the Scrape endpoint are available in crawl via scrapeOptions (JS) / scrape_options (Python). These apply to every page the crawler scrapes, including formats, proxy, caching, actions, location, and tags.
from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key='fc-YOUR_API_KEY')

# Crawl with scrape options
response = firecrawl.crawl('https://example.com',
    limit=100,
    scrape_options={
        'formats': [
            'markdown',
            { 'type': 'json', 'schema': { 'type': 'object', 'properties': { 'title': { 'type': 'string' } } } }
        ],
        'proxy': 'auto',
        'max_age': 600000,
        'only_main_content': True
    }
)

Checking crawl status

Use the job ID to poll for the crawl status and retrieve results.
status = firecrawl.get_crawl_status("<crawl-id>")
print(status)
Job results are available via the API for 24 hours after completion. After this period, you can still view your crawl history and results in the activity logs.
Pages in the crawl results data array are pages that Firecrawl successfully scraped, even if the target site returned an HTTP error like 404. The metadata.statusCode field shows the HTTP status code from the target site. To retrieve pages that Firecrawl itself failed to scrape (e.g. network errors, timeouts, or robots.txt blocks), use the dedicated Get Crawl Errors endpoint (GET /crawl/{id}/errors).

Response handling

The response varies based on the crawl’s status. For incomplete or large responses exceeding 10MB, a next URL parameter is provided. You must request this URL to retrieve the next 10MB of data. If the next parameter is absent, it indicates the end of the crawl data.
The skip and next parameters are only relevant when hitting the API directly. If you’re using the SDK, pagination is handled automatically and all results are returned at once.
{
  "status": "scraping",
  "total": 36,
  "completed": 10,
  "creditsUsed": 10,
  "expiresAt": "2024-00-00T00:00:00.000Z",
  "next": "https://api.firecrawl.dev/v2/crawl/123-456-789?skip=10",
  "data": [
    {
      "markdown": "[Firecrawl Docs home page![light logo](https://mintlify.s3-us-west-1.amazonaws.com/firecrawl/logo/light.svg)!...",
      "html": "<!DOCTYPE html><html lang=\"en\" class=\"js-focus-visible lg:[--scroll-mt:9.5rem]\" data-js-focus-visible=\"\">...",
      "metadata": {
        "title": "Build a 'Chat with website' using Groq Llama 3 | Firecrawl",
        "language": "en",
        "sourceURL": "https://docs.firecrawl.dev/learn/rag-llama3",
        "description": "Learn how to use Firecrawl, Groq Llama 3, and Langchain to build a 'Chat with your website' bot.",
        "ogLocaleAlternate": [],
        "statusCode": 200
      }
    },
    ...
  ]
}

SDK methods

There are two ways to use crawl with the SDK.

Crawl and wait

The crawl method waits for the crawl to complete and returns the full response. It handles pagination automatically. This is recommended for most use cases.
from firecrawl import Firecrawl
from firecrawl.types import ScrapeOptions

firecrawl = Firecrawl(api_key="fc-YOUR_API_KEY")

# Crawl a website:
crawl_status = firecrawl.crawl(
  'https://firecrawl.dev', 
  limit=100, 
  scrape_options=ScrapeOptions(formats=['markdown', 'html']),
  poll_interval=30
)
print(crawl_status)
The response includes the crawl status and all scraped data:
success=True
status='completed'
completed=100
total=100
creditsUsed=100
expiresAt=datetime.datetime(2025, 4, 23, 19, 21, 17, tzinfo=TzInfo(UTC))
next=None
data=[
  Document(
    markdown='[Day 7 - Launch Week III.Integrations DayApril 14th to 20th](...',
    metadata={
      'title': '15 Python Web Scraping Projects: From Beginner to Advanced',
      ...
      'scrapeId': '97dcf796-c09b-43c9-b4f7-868a7a5af722',
      'sourceURL': 'https://www.firecrawl.dev/blog/python-web-scraping-projects',
      'url': 'https://www.firecrawl.dev/blog/python-web-scraping-projects',
      'statusCode': 200
    }
  ),
  ...
]

Start and check later

The startCrawl / start_crawl method returns immediately with a crawl ID. You then poll for status manually. This is useful for long-running crawls or custom polling logic.
from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

job = firecrawl.start_crawl(url="https://docs.firecrawl.dev", limit=10)
print(job)

# Check the status of the crawl
status = firecrawl.get_crawl_status(job.id)
print(status)
The initial response returns the job ID:
{
  "success": true,
  "id": "123-456-789",
  "url": "https://api.firecrawl.dev/v2/crawl/123-456-789"
}

Real-time results with WebSocket

The watcher method provides real-time updates as pages are crawled. Start a crawl, then subscribe to events for immediate data processing.
import asyncio
from firecrawl import AsyncFirecrawl

async def main():
    firecrawl = AsyncFirecrawl(api_key="fc-YOUR-API-KEY")

    # Start a crawl first
    started = await firecrawl.start_crawl("https://firecrawl.dev", limit=5)

    # Watch updates (snapshots) until terminal status
    async for snapshot in firecrawl.watcher(started.id, kind="crawl", poll_interval=2, timeout=120):
        if snapshot.status == "completed":
            print("DONE", snapshot.status)
            for doc in snapshot.data:
                print("DOC", doc.metadata.source_url if doc.metadata else None)
        elif snapshot.status == "failed":
            print("ERR", snapshot.status)
        else:
            print("STATUS", snapshot.status, snapshot.completed, "/", snapshot.total)

asyncio.run(main())

Webhooks

You can configure webhooks to receive real-time notifications as your crawl progresses. This allows you to process pages as they are scraped instead of waiting for the entire crawl to complete.
cURL
curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "limit": 100,
      "webhook": {
        "url": "https://your-domain.com/webhook",
        "metadata": {
          "any_key": "any_value"
        },
        "events": ["started", "page", "completed"]
      }
    }'

Event types

EventDescription
crawl.startedFires when the crawl begins
crawl.pageFires for each page successfully scraped
crawl.completedFires when the crawl finishes
crawl.failedFires if the crawl encounters an error

Payload

{
  "success": true,
  "type": "crawl.page",
  "id": "crawl-job-id",
  "data": [...], // Page data for 'page' events
  "metadata": {}, // Your custom metadata
  "error": null
}

Verifying webhook signatures

Every webhook request from Firecrawl includes an X-Firecrawl-Signature header containing an HMAC-SHA256 signature. Always verify this signature to ensure the webhook is authentic and has not been tampered with.
  1. Get your webhook secret from the Advanced tab of your account settings
  2. Extract the signature from the X-Firecrawl-Signature header
  3. Compute HMAC-SHA256 of the raw request body using your secret
  4. Compare with the signature header using a timing-safe function
Never process a webhook without verifying its signature first. The X-Firecrawl-Signature header contains the signature in the format: sha256=abc123def456...
For complete implementation examples in JavaScript and Python, see the Webhook Security documentation. For comprehensive webhook documentation including detailed event payloads, payload structure, advanced configuration, and troubleshooting, see the Webhooks documentation.

Configuration reference

The full set of parameters available when submitting a crawl job:
ParameterTypeDefaultDescription
urlstring(required)The starting URL to crawl from
limitinteger10000Maximum number of pages to crawl
maxDiscoveryDepthinteger(none)Maximum depth from the root URL based on link-discovery hops, not the number of / segments in the URL. Each time a new URL is found on a page, it is assigned a depth one higher than the page it was discovered on. The root site and sitemapped pages have a discovery depth of 0. Pages at the max depth are still scraped, but links on them are not followed.
includePathsstring[](none)URL pathname regex patterns to include. Only matching paths are crawled.
excludePathsstring[](none)URL pathname regex patterns to exclude from the crawl
regexOnFullURLbooleanfalseMatch includePaths/excludePaths against the full URL (including query parameters) instead of just the pathname
crawlEntireDomainbooleanfalseFollow internal links to sibling or parent URLs, not just child paths
allowSubdomainsbooleanfalseFollow links to subdomains of the main domain
allowExternalLinksbooleanfalseFollow links to external websites
sitemapstring"include"Sitemap handling: "include" (default), "skip", or "only"
ignoreQueryParametersbooleanfalseAvoid re-scraping the same path with different query parameters
delaynumber(none)Delay in seconds between scrapes to respect rate limits
maxConcurrencyinteger(none)Maximum concurrent scrapes. Defaults to your team’s concurrency limit.
scrapeOptionsobject(none)Options applied to every scraped page (formats, proxy, caching, actions, etc.)
webhookobject(none)Webhook configuration for real-time notifications
promptstring(none)Natural language prompt to generate crawl options. Explicitly set parameters override generated equivalents.

Important details

By default, crawl ignores sublinks that are not children of the URL you provide. For example, website.com/other-parent/blog-1 would not be returned if you crawled website.com/blogs/. Use the crawlEntireDomain parameter to include sibling and parent paths. To crawl subdomains like blog.website.com when crawling website.com, use the allowSubdomains parameter.
  • Sitemap discovery: By default, the crawler includes the website’s sitemap to discover URLs (sitemap: "include"). If you set sitemap: "skip", only pages reachable through HTML links from the root URL are found. Assets like PDFs or deeply nested pages listed in the sitemap but not directly linked from HTML will be missed. For maximum coverage, keep the default setting.
  • Credit usage: Each page crawled costs 1 credit. JSON mode adds 4 credits per page, enhanced proxy adds 4 credits per page, and PDF parsing costs 1 credit per PDF page.
  • Result expiration: Job results are available via the API for 24 hours after completion. After that, view results in the activity logs.
  • Crawl errors: The data array contains pages Firecrawl successfully scraped. Use the Get Crawl Errors endpoint to retrieve pages that failed due to network errors, timeouts, or robots.txt blocks.
  • Non-deterministic results: Crawl results may vary between runs of the same configuration. Pages are scraped concurrently, so the order in which links are discovered depends on network timing and which pages finish loading first. This means different branches of a site may be explored to different extents near the depth boundary, especially at higher maxDiscoveryDepth values. To get more deterministic results, set maxConcurrency to 1 or use sitemap: "only" if the site has a comprehensive sitemap.