- Discovers pages via sitemap and recursive link traversal
- Supports path filtering, depth limits, and subdomain/external link control
- Returns results via polling, WebSocket, or webhook
Try it in the Playground
Test crawling in the interactive playground — no code required.
Installation
Basic usage
Submit a crawl job by callingPOST /v2/crawl with a starting URL. The endpoint returns a job ID that you use to poll for results.
Each page crawled consumes 1 credit. The default crawl
limit is 10,000 pages. Set a lower limit to control credit usage (e.g. limit: 100). Additional credits apply for certain options: JSON mode costs 4 additional credits per page, enhanced proxy costs 4 additional credits per page, and PDF parsing costs 1 credit per PDF page.Scrape options
All options from the Scrape endpoint are available in crawl viascrapeOptions (JS) / scrape_options (Python). These apply to every page the crawler scrapes, including formats, proxy, caching, actions, location, and tags.
Checking crawl status
Use the job ID to poll for the crawl status and retrieve results.Job results are available via the API for 24 hours after completion. After this period, you can still view your crawl history and results in the activity logs.
Pages in the crawl results
data array are pages that Firecrawl successfully scraped, even if the target site returned an HTTP error like 404. The metadata.statusCode field shows the HTTP status code from the target site. To retrieve pages that Firecrawl itself failed to scrape (e.g. network errors, timeouts, or robots.txt blocks), use the dedicated Get Crawl Errors endpoint (GET /crawl/{id}/errors).Response handling
The response varies based on the crawl’s status. For incomplete or large responses exceeding 10MB, anext URL parameter is provided. You must request this URL to retrieve the next 10MB of data. If the next parameter is absent, it indicates the end of the crawl data.
The
skip and next parameters are only relevant when hitting the API directly.
If you’re using the SDK, pagination is handled automatically and all
results are returned at once.SDK methods
There are two ways to use crawl with the SDK.Crawl and wait
Thecrawl method waits for the crawl to complete and returns the full response. It handles pagination automatically. This is recommended for most use cases.
Start and check later
ThestartCrawl / start_crawl method returns immediately with a crawl ID. You then poll for status manually. This is useful for long-running crawls or custom polling logic.
Real-time results with WebSocket
The watcher method provides real-time updates as pages are crawled. Start a crawl, then subscribe to events for immediate data processing.Webhooks
You can configure webhooks to receive real-time notifications as your crawl progresses. This allows you to process pages as they are scraped instead of waiting for the entire crawl to complete.cURL
Event types
| Event | Description |
|---|---|
crawl.started | Fires when the crawl begins |
crawl.page | Fires for each page successfully scraped |
crawl.completed | Fires when the crawl finishes |
crawl.failed | Fires if the crawl encounters an error |
Payload
Verifying webhook signatures
Every webhook request from Firecrawl includes anX-Firecrawl-Signature header containing an HMAC-SHA256 signature. Always verify this signature to ensure the webhook is authentic and has not been tampered with.
- Get your webhook secret from the Advanced tab of your account settings
- Extract the signature from the
X-Firecrawl-Signatureheader - Compute HMAC-SHA256 of the raw request body using your secret
- Compare with the signature header using a timing-safe function
Configuration reference
The full set of parameters available when submitting a crawl job:| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | (required) | The starting URL to crawl from |
limit | integer | 10000 | Maximum number of pages to crawl |
maxDiscoveryDepth | integer | (none) | Maximum depth from the root URL based on link-discovery hops, not the number of / segments in the URL. Each time a new URL is found on a page, it is assigned a depth one higher than the page it was discovered on. The root site and sitemapped pages have a discovery depth of 0. Pages at the max depth are still scraped, but links on them are not followed. |
includePaths | string[] | (none) | URL pathname regex patterns to include. Only matching paths are crawled. |
excludePaths | string[] | (none) | URL pathname regex patterns to exclude from the crawl |
regexOnFullURL | boolean | false | Match includePaths/excludePaths against the full URL (including query parameters) instead of just the pathname |
crawlEntireDomain | boolean | false | Follow internal links to sibling or parent URLs, not just child paths |
allowSubdomains | boolean | false | Follow links to subdomains of the main domain |
allowExternalLinks | boolean | false | Follow links to external websites |
sitemap | string | "include" | Sitemap handling: "include" (default), "skip", or "only" |
ignoreQueryParameters | boolean | false | Avoid re-scraping the same path with different query parameters |
delay | number | (none) | Delay in seconds between scrapes to respect rate limits |
maxConcurrency | integer | (none) | Maximum concurrent scrapes. Defaults to your team’s concurrency limit. |
scrapeOptions | object | (none) | Options applied to every scraped page (formats, proxy, caching, actions, etc.) |
webhook | object | (none) | Webhook configuration for real-time notifications |
prompt | string | (none) | Natural language prompt to generate crawl options. Explicitly set parameters override generated equivalents. |
Important details
- Sitemap discovery: By default, the crawler includes the website’s sitemap to discover URLs (
sitemap: "include"). If you setsitemap: "skip", only pages reachable through HTML links from the root URL are found. Assets like PDFs or deeply nested pages listed in the sitemap but not directly linked from HTML will be missed. For maximum coverage, keep the default setting. - Credit usage: Each page crawled costs 1 credit. JSON mode adds 4 credits per page, enhanced proxy adds 4 credits per page, and PDF parsing costs 1 credit per PDF page.
- Result expiration: Job results are available via the API for 24 hours after completion. After that, view results in the activity logs.
- Crawl errors: The
dataarray contains pages Firecrawl successfully scraped. Use the Get Crawl Errors endpoint to retrieve pages that failed due to network errors, timeouts, or robots.txt blocks. - Non-deterministic results: Crawl results may vary between runs of the same configuration. Pages are scraped concurrently, so the order in which links are discovered depends on network timing and which pages finish loading first. This means different branches of a site may be explored to different extents near the depth boundary, especially at higher
maxDiscoveryDepthvalues. To get more deterministic results, setmaxConcurrencyto1or usesitemap: "only"if the site has a comprehensive sitemap.

