/crawl - Crawl web content
The /crawl endpoint scrapes content from a starting URL and follows links across the site, up to a configurable depth or page limit. Responses can be returned as HTML, Markdown, or JSON.
Before you begin, make sure you create a custom API Token with the Browser Rendering - Edit permission. For more information, refer to REST API — Before you begin.
https://api.cloudflare.com/client/v4/accounts/<account_id>/browser-rendering/crawlurl(string)
Refer to optional parameters for additional customization options.
- Building knowledge bases or training AI systems (such as RAG applications) with up-to-date web content
- Scraping and analyzing content across multiple pages for research, summarization, or monitoring
There are two steps to using the /crawl endpoint:
- Initiate the crawl job — A
POSTrequest where you initiate the crawl and receive a response with a jobid. - Request results of the crawl job — A
GETrequest where you request the status or results of the crawl.
Crawl jobs have a maximum run time of seven days. If a job does not finish within this time, it will be cancelled due to timeout. Job results are available for 14 days after the job completes, after which the job data is deleted.
Send a POST request with a url to start a crawl job. The API responds immediately with a job id you will use to retrieve results. Refer to optional parameters for additional customization options.
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://developers.cloudflare.com/workers/" }'Example response:
{ "success": true, "result": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"}To check the status or request the results of your crawl job, use the job id you received:
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \ -H 'Authorization: Bearer YOUR_API_TOKEN'The response includes a status field indicating the current state of the crawl job. The possible job statuses are:
running— The crawl job is currently in progress.cancelled_due_to_timeout— The crawl job exceeded the maximum run time of seven days.cancelled_due_to_limits— The crawl job was cancelled because it hit account limits.cancelled_by_user— The crawl job was manually cancelled by the user.errored— The crawl job encountered an error.completed— The crawl job finished successfully.
Since crawl jobs run asynchronously, you can poll the endpoint periodically to check when the job finishes. Add ?limit=1 to the request URL so the response stays lightweight — you only need the job status, not the full set of crawled records.
async function waitForCrawl(accountId, jobId, apiToken) { const maxAttempts = 60; const delayMs = 5000;
for (let i = 0; i < maxAttempts; i++) { const response = await fetch( `https://api.cloudflare.com/client/v4/accounts/${accountId}/browser-rendering/crawl/${jobId}?limit=1`, { headers: { Authorization: `Bearer ${apiToken}`, }, }, );
const data = await response.json(); const status = data.result.status;
if (status !== "running") { return data.result; }
await new Promise((resolve) => setTimeout(resolve, delayMs)); }
throw new Error("Crawl job did not complete within timeout");}Once the job reaches a terminal status, fetch the full results without the limit parameter. You can also use the following query parameters to filter and paginate results:
cursor— Cursor for pagination. If the response exceeds 10 MB, acursorvalue will be included. Pass it as a query parameter to retrieve the next page of results.limit— Maximum number of records to return.status— Filter by URL status:queued,completed,disallowed,skipped,errored, orcancelled.
Example with query parameters:
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e?cursor=10&limit=10&status=completed' \ -H 'Authorization: Bearer YOUR_API_TOKEN'Example response:
{ "result": { "id": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e", "status": "completed", "browserSecondsUsed": 134.7, "total": 50, "finished": 50, "records": [ { "url": "https://developers.cloudflare.com/workers/", "status": "completed", "markdown": "# Cloudflare Workers\nBuild and deploy serverless applications...", "metadata": { "status": 200, "title": "Cloudflare Workers · Cloudflare Workers docs", "url": "https://developers.cloudflare.com/workers/" } }, { "url": "https://developers.cloudflare.com/workers/get-started/quickstarts/", "status": "completed", "markdown": "## Quickstarts\nGet up and running with a simple Hello World...", "metadata": { "status": 200, "title": "Quickstarts · Cloudflare Workers docs", "url": "https://developers.cloudflare.com/workers/get-started/quickstarts/" } } // ... 48 more entries omitted for brevity ], "cursor": 10 }, "success": true}To cancel a crawl job that is currently in progress, use the job id you received:
curl -X DELETE 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e' \ -H 'Authorization: Bearer YOUR_API_TOKEN'A successful cancellation will return a 200 OK status code. The job status will be updated to cancelled, and all URLs that have been queued to be crawled will be cancelled.
The following optional parameters can be used in your crawl request, in addition to the required url parameter. For the full list, refer to the API docs.
| Optional parameter | Type | Description |
|---|---|---|
limit | Number | Maximum number of pages to crawl (default is 10, maximum is 100,000). |
depth | Number | Maximum link depth to crawl from the starting URL (default is 100,000, maximum is 100,000). |
source | String | Source for discovering URLs. Options are all, sitemaps, or links. Default is all. |
formats | Array of strings | Response format (default is HTML, other options are Markdown and JSON). The JSON format leverages Workers AI by default for data extraction, which incurs usage on Workers AI. Refer to the /json endpoint to learn more, including how to use a custom model and fallbacks. |
render | Boolean | If false, does a fast HTML fetch without executing JavaScript (default is true, learn more about render). |
jsonOptions | Object | Only required if formats includes json. Contains prompt, response_format, and custom_ai properties (same types as the /json endpoint). |
maxAge | Number | Maximum length of time in seconds the crawler can use a cached resource before it must re-fetch it from the origin server (default is 86,400, maximum is 604,800). Cache is served from R2 only if the URL and parameters exactly match. |
modifiedSince | Number | Unix timestamp (in seconds) indicating to only crawl pages that were modified since this time. |
options.includeExternalLinks | Boolean | If true, follows links to external domains (default is false). |
options.includeSubdomains | Boolean | If true, follows links to subdomains of the starting URL (default is false). |
options.includePatterns | Array of strings | Only visits URLs that match one of these wildcard patterns. Use * to match any characters except /, or ** to match any characters including /. |
options.excludePatterns | Array of strings | Does not visit URLs that match any of these wildcard patterns. Use * to match any characters except /, or ** to match any characters including /. |
excludePatterns has strictly higher priority. If a URL matches an exclude rule, it is skipped, regardless of whether it matches an include rule.
- No rules — Everything is indexed.
- Exclude only — Everything is indexed except items matching the exclude patterns.
- Include only — Only items matching the include patterns are indexed; everything else is ignored.
To view URLs that were discovered but skipped, query the crawl job results with status=skipped. URLs can be skipped due to includeExternalLinks, includeSubdomains, includePatterns/excludePatterns, or the modifiedSince parameter. Skipped URLs will also be visible in the dashboard in a future release.
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}?status=skipped' \ -H 'Authorization: Bearer YOUR_API_TOKEN'If you use render: true, which is the default, the crawl endpoint spins up a headless browser and executes page JavaScript. If you use render: false, the crawl endpoint does a fast HTML fetch without executing JavaScript.
Use render: true when the page builds content in the browser. Use render: false when the content you need is already in the initial HTML response.
Crawls that use render: true use a headless browser and are billed under typical Browser Rendering pricing. Crawls that use render: false run on Workers instead of a headless browser. During the beta, render: false crawls are not billed. After the beta, they will be billed under Workers pricing.
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://www.exampledocs.com/docs/", "limit": 50, "depth": 2, "formats": ["markdown"], "render": false, "maxAge": 7200, "modifiedSince": 1704067200, "source": "all", "options": { "includeExternalLinks": true, "includeSubdomains": true, "includePatterns": [ "**/api/v1/*" ], "excludePatterns": [ "*/learning-paths/*" ] }}'Crawl only documentation pages and exclude specific sections:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com/docs", "limit": 200, "depth": 5, "formats": ["markdown"], "options": { "includePatterns": [ "https://example.com/docs/**" ], "excludePatterns": [ "https://example.com/docs/changelog/**", "https://example.com/docs/archive/**" ] } }'Extract structured product data using the json format. This leverages Workers AI by default.
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://shop.example.com/products", "limit": 50, "formats": ["json"], "jsonOptions": { "prompt": "Extract product name, price, description, and availability", "response_format": { "type": "json_schema", "json_schema": { "name": "product", "properties": { "name": "string", "price": "number", "currency": "string", "description": "string", "inStock": "boolean" } } } }, "options": { "includePatterns": [ "https://shop.example.com/products/*" ] } }'Fetch static HTML without rendering for faster crawling of static sites:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com", "limit": 100, "render": false, "formats": ["html", "markdown"] }'Crawl pages behind HTTP authentication or with custom headers:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://secure.example.com", "limit": 50, "authenticate": { "username": "user", "password": "pass" } }'You can also use cookies or custom headers for token-based authentication:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://api.example.com/docs", "limit": 100, "setExtraHTTPHeaders": { "X-API-Key": "your-api-key" } }'Crawl single-page applications that load content dynamically:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://app.example.com", "limit": 50, "gotoOptions": { "waitUntil": "networkidle2", "timeout": 60000 }, "waitForSelector": { "selector": "[data-content-loaded]", "timeout": 30000, "visible": true } }'Speed up crawling by blocking images and media:
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \ -H 'Authorization: Bearer <apiToken>' \ -H 'Content-Type: application/json' \ -d '{ "url": "https://example.com", "limit": 100, "rejectResourceTypes": [ "image", "media", "font", "stylesheet" ] }'The crawler discovers and processes URLs in the following order (when using source: all, the default):
- Starting URL — The URL specified in your request.
- Sitemap links — URLs found in the site's sitemap.
- Page links — Links scraped from pages, if not already found in the sitemap.
Use the source parameter to customize which sources the crawler uses. The available options are:
all— Uses both sitemaps and page links (default).sitemaps— Only crawls URLs found in the site's sitemap.links— Only crawls links found on pages, ignoring sitemaps.
The /crawl endpoint respects the directives of robots.txt files, including crawl-delay. All URLs that /crawl is directed not to crawl are listed in the response with "status": "disallowed". For guidance on configuring robots.txt and sitemaps for sites you plan to crawl, refer to robots.txt and sitemaps.
You can change the user agent at the page level by passing userAgent as a top-level parameter in the JSON body. This is useful if the target website serves different content based on the user agent.
The /crawl endpoint uses CloudflareBrowserRenderingCrawler/1.0 as its default User-Agent, which is different from the other REST API endpoints. For a full list of default User-Agent strings, refer to Automatic request headers.
If your crawl job completes but returns an empty records array, or all URLs show skipped or disallowed status:
- robots.txt blocking — The crawler respects
robots.txtrules. The/crawlendpoint identifies itself asCloudflareBrowserRenderingCrawler/1.0. Check the target site'srobots.txtfile to verify this user agent is allowed. Blocked URLs appear with"status": "disallowed". - Pattern filters too restrictive — Your
includePatternsmay not match any URLs on the site. Try crawling without patterns first to confirm URLs are discoverable, then add patterns. - No links found — The starting URL may not contain links. Try using
source: "sitemaps", increasing thedepthparameter, or settingincludeSubdomainsorincludeExternalLinkstotrue.
If a crawl job remains in running status for an extended period:
- Slow page loads — Pages with heavy JavaScript take longer to render. Use
render: falseif the content you need is in the initial HTML. - Rate limiting — Sites with strict rate limits slow crawling. The crawler respects
robots.txtCrawl-delayand implements backoff. Reducelimitand run multiple smaller crawls. - Unnecessary resources — Block resources that are not needed for content extraction using
rejectResourceTypes(for example,image,media,font).
A cancelled_due_to_limits status means your account hit its browser time limit. Workers Free plan accounts are capped at 10 minutes of browser use per day. To resolve this:
- Upgrade to a Workers Paid plan for higher limits.
- Use
render: falsefor static content to avoid consuming browser time. - Increase
maxAgeto use cached results where possible. - Reduce the
limitparameter.
If the json format returns null or empty results:
- Provide a clear prompt — Be specific about what data to extract and where it appears on the page (for example, "Extract the product name, price, and description from the main product section").
- Define a response schema — Use
response_formatwith a JSON schema to enforce the expected output structure. - Use a custom model — If the default Workers AI model does not produce the desired results, use the
custom_aiparameter to specify a different model. Refer to Using a custom model (BYO API Key) for details.
If you have questions or encounter other errors, refer to the Browser Rendering FAQ and troubleshooting guide.
If you have questions or encounter an error, see the Browser Rendering FAQ and troubleshooting guide.