---
title: Website
description: Connect a domain you own as a data source so AI Search can crawl and index your website pages.
image: https://developers.cloudflare.com/dev-products-preview.png
---

> Documentation Index  
> Fetch the complete documentation index at: https://developers.cloudflare.com/ai-search/llms.txt  
> Use this file to discover all available pages before exploring further.

[Skip to content](#%5Ftop) 

# Website

You can connect a website you own as a data source for your AI Search instance. AI Search crawls and indexes the pages automatically.

You can only crawl domains that you have onboarded onto the same Cloudflare account. Refer to [Onboard a domain](https://developers.cloudflare.com/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account.

Bot protection may block crawling

If you use Cloudflare products that control or restrict bot traffic such as [Bot Management](https://developers.cloudflare.com/bots/), [Web Application Firewall (WAF)](https://developers.cloudflare.com/waf/), or [Turnstile](https://developers.cloudflare.com/turnstile/), the same rules will apply to the AI Search crawler. Make sure to configure an exception or an allow-list for the AI Search crawler in your settings.

## Get started

You can connect a website when creating a new instance through the [dashboard](https://developers.cloudflare.com/ai-search/get-started/dashboard/), the [REST API](https://developers.cloudflare.com/ai-search/get-started/api/), or [Wrangler](https://developers.cloudflare.com/ai-search/get-started/wrangler/). Website is an optional data source that you can add alongside [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/).

## How website crawling works

When you connect a domain, the crawler looks for your website's sitemap to determine which pages to visit:

1. If you configure one or more custom sitemap URLs in the dashboard under **Parser options** \> **Specific sitemap**, AI Search crawls only those sitemap URLs.
2. Otherwise, the crawler checks `robots.txt` for listed sitemaps.
3. If no `robots.txt` is found, the crawler checks for a sitemap at `/sitemap.xml`.
4. If no sitemap is available, the domain cannot be crawled.

### Indexing order

If your sitemaps include `<priority>` attributes, AI Search reads all sitemaps and indexes pages based on each page's priority value, regardless of which sitemap the page is in.

If no `<priority>` is specified, pages are indexed in the order the sitemaps are provided, either from the configured custom sitemap URLs or from `robots.txt` from top to bottom.

AI Search supports `.gz` compressed sitemaps. Both `robots.txt` and sitemaps can use partial URLs.

### Sync and updates

During scheduled or manual [sync jobs](https://developers.cloudflare.com/ai-search/configuration/indexing/syncing/), the crawler will check for changes to the `<lastmod>` attribute in your sitemap. If it has been changed to a date occurring after the last sync date, then the page will be crawled, the updated version is stored, and automatically reindexed so that your search results always reflect the latest content.

If the `<lastmod>` attribute is not defined, AI Search uses the `<changefreq>` attribute to determine how often to re-crawl the URL. If neither `<lastmod>` nor `<changefreq>` is defined, AI Search automatically crawls each link once a day.

## Storage

For instances with [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/), crawled pages are stored in managed storage automatically.

For older instances created before **April 16, 2026**, AI Search creates a dedicated R2 bucket in your account to store crawled pages. This bucket is automatically managed and is used only for content discovered by the crawler.

Note

For instances with a dedicated R2 bucket, do not modify the bucket directly as it may disrupt the indexing flow and cause content to not be updated properly.

## Configuration

### Path filtering

You can control which pages get indexed by defining include and exclude rules for URL paths. Use this to limit indexing to specific sections of your site or to exclude content you do not want searchable.

Note

Path filtering matches against the full URL, including the scheme, hostname, and subdomains. For example, a page at `https://www.example.com/blog/post` requires a pattern like `**/blog/**` to match. Using `/blog/**` alone will not match because it does not account for the hostname.

For example, to index only blog posts while excluding drafts:

* **Include:** `**/blog/**`
* **Exclude:** `**/blog/drafts/**`

Refer to [Path filtering](https://developers.cloudflare.com/ai-search/configuration/indexing/path-filtering/) for pattern syntax, filtering behavior, and more examples.

For supported file types and size limits, refer to [Data source](https://developers.cloudflare.com/ai-search/configuration/data-source/#supported-file-types).

### Parsing options

You can configure parsing options during onboarding or in your instance settings under **Parser options**.

#### Specific sitemap

By default, AI Search crawls all sitemaps listed in your `robots.txt` in the order they appear (top to bottom). If you do not want the crawler to index everything, or if your sitemap is hosted at a non-standard path, you can configure custom sitemap URLs in the dashboard under **Parser options** \> **Specific sitemap**.

When custom sitemap URLs are configured, AI Search uses those sitemap URLs instead of auto-discovering sitemaps from `robots.txt` or `/sitemap.xml`. You can add up to five sitemap URLs.

#### Rendering mode

You can choose how pages are parsed during crawling:

* **Static sites**: Downloads the raw HTML for each page.
* **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. For instances with [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/), Browser Run is included. For older instances, [Browser Run](https://developers.cloudflare.com/browser-run/pricing/) limits and billing apply.

### Extra headers

If your website has pages behind authentication or are only visible to logged-in users, you can configure custom HTTP headers to allow the AI Search crawler to access this protected content. You can add up to five custom HTTP headers to the requests AI Search sends when crawling your site.

#### Providing access to sites protected by Cloudflare Access

To allow AI Search to crawl a site protected by [Cloudflare Access](https://developers.cloudflare.com/cloudflare-one/access-controls/), you need to create service token credentials and configure them as custom headers.

Service tokens bypass user authentication, so ensure your Access policies are configured appropriately for the content you want to index. The service token will allow the AI Search crawler to access all content covered by the Service Auth policy.

1. In the [Cloudflare dashboard ↗](https://dash.cloudflare.com/), [create a service token](https://developers.cloudflare.com/cloudflare-one/access-controls/service-credentials/service-tokens/#create-a-service-token). Once the Client ID and Client Secret are generated, save them for the next steps. For example they can look like:  
```  
CF-Access-Client-Id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access  
CF-Access-Client-Secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  
```
2. [Create a policy](https://developers.cloudflare.com/cloudflare-one/access-controls/policies/policy-management/#create-a-policy) with the following configuration:  
   * Add an **Include** rule with **Selector** set to **Service token**.  
   * In **Value**, select the Service Token you created in step 1.
3. [Add your self-hosted application to Access](https://developers.cloudflare.com/cloudflare-one/access-controls/applications/http-apps/self-hosted-public-app/) and with the following configuration:  
   * In Access policies, click **Select existing policies**.  
   * Select the policy that you have just created and select **Confirm**.
4. In the Cloudflare dashboard, go to the **AI Search** page.  
[ Go to **AI Search** ](https://dash.cloudflare.com/?to=/:account/ai/ai-search)
5. Select **Create**.
6. Select **Website** as your data source.
7. Under **Parse options**, locate **Extra headers** and add the following two headers using your saved credentials:  
   * Header 1:  
         * **Key**: `CF-Access-Client-Id`  
         * **Value**: `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access`  
   * Header 2:  
         * **Key**: `CF-Access-Client-Secret`  
         * **Value**: `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx`
8. Complete the AI Search setup process to create your search instance.

## Custom metadata

You can attach custom metadata to web pages using HTML `<meta>` tags. AI Search extracts metadata from the `<head>` section of each crawled page.

Before custom metadata can be extracted, you must [define a schema](https://developers.cloudflare.com/ai-search/configuration/indexing/metadata/#define-a-schema) in your AI Search configuration.

### Add metadata to web pages

Add `<meta>` tags using either the `name` or `property` attribute:

```

<!DOCTYPE html>

<html>

  <head>

    <meta name="title" content="Getting Started Guide" />

    <meta name="description" content="Learn how to set up the application" />

    <meta property="og:title" content="Getting Started Guide" />

    <meta property="og:image" content="https://example.com/og-image.png" />

    <meta name="category" content="documentation" />

    <meta name="version" content="2.5" />

    <meta name="is_public" content="true" />

  </head>

  <body>

    <!-- Page content -->

  </body>

</html>


```

### Recognized fields

For the following fields, AI Search knows which meta tags to extract from. You must still define these in your schema to enable extraction.

| Field       | Source                                                        |
| ----------- | ------------------------------------------------------------- |
| title       | <meta name="title"> or <meta property="og:title">             |
| description | <meta name="description"> or <meta property="og:description"> |
| image       | <meta property="og:image">                                    |

When both a standard meta tag and an Open Graph tag are present, the standard meta tag takes precedence.

### How metadata extraction works

When the crawler fetches a page:

1. All `<meta>` tags with `name` or `property` attributes are parsed from the `<head>` section.
2. Tag names are matched against your schema (case-insensitive).
3. The `content` attribute value is cast to the configured data type.
4. Extracted metadata is stored alongside the cached HTML.
5. On subsequent processing, metadata flows into the vector index.

### Boolean value parsing

For `boolean` fields, the following values are accepted (case-insensitive):

| True values  | False values |
| ------------ | ------------ |
| true, 1, yes | false, 0, no |

Any other value is treated as invalid and the field is omitted.

## Content selectors

Content selectors let you control which parts of a crawled page are indexed. Each entry pairs a URL glob pattern with a CSS selector. When a page URL matches a glob pattern, only the elements matching the corresponding CSS selector — and their descendants — are extracted and converted to Markdown for indexing.

The list is ordered and the **first matching path wins**. If a page URL matches multiple glob patterns, only the selector from the first match is applied. Order your entries from most specific to least specific.

### Default behavior

Without content selectors, AI Search applies a default processing pipeline that removes elements such as `<header>`, `<footer>`, and `<head>` before converting the remaining content to Markdown. For more details on how HTML is processed, refer to [How HTML is processed](https://developers.cloudflare.com/workers-ai/features/markdown-conversion/how-it-works/#html).

### Configure content selectors in the dashboard

1. Go to the [AI Search ↗](https://dash.cloudflare.com/?to=/:account/ai/ai-search) page in the Cloudflare dashboard.  
[ Go to **AI Search** ](https://dash.cloudflare.com/?to=/:account/ai/ai-search)
2. Select your AI Search instance, or select **Create** to create a new one with a **Website** data source.
3. Under the data source settings, locate the **Content selectors** section.
4. Select **Add selector**.
5. In the **Path** field, enter a glob pattern to match page URLs. For example, `**/blog/**`.
6. In the **Selector** field, enter a CSS selector to extract content from matching pages. For example, `article .post-body`.
7. To add more entries, select **Add selector** again. Entries are evaluated in order from top to bottom.

### Configure content selectors via the API

Content selectors are configured in the `source_params.web_crawler.parse_options.content_selector` field when creating or updating an AI Search instance. The field accepts an array of objects, each with a `path` and `selector` property.

Terminal window

```

curl "https://api.cloudflare.com/client/v4/accounts/{account_id}/ai-search/instances" \

  -H "Authorization: Bearer {api_token}" \

  -H "Content-Type: application/json" \

  -d '{

    "id": "my-ai-search",

    "source": "https://example.com",

    "type": "web-crawler",

    "source_params": {

      "web_crawler": {

        "parse_options": {

          "content_selector": [

            {

              "path": "**/blog/**",

              "selector": "article .post-body"

            },

            {

              "path": "**/docs/**",

              "selector": "main .content"

            }

          ]

        }

      }

    }

  }'


```

| Field    | Type   | Description                                                                                                                                                                                                                                                         |
| -------- | ------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| path     | string | Glob pattern to match against the full page URL. Uses the same glob syntax as [path filtering](https://developers.cloudflare.com/ai-search/configuration/indexing/path-filtering/) — \* matches within a segment, \*\* crosses directories. Maximum 200 characters. |
| selector | string | CSS selector to extract content from pages matching the path pattern. Supports standard CSS selectors including element, class, ID, and attribute selectors. Maximum 200 characters.                                                                                |

### Examples

#### Extract main content from blog pages

To index only the article body on blog pages and ignore navigation, sidebars, and footers:

| Path           | Selector           |
| -------------- | ------------------ |
| \*\*/blog/\*\* | article .post-body |

#### Target documentation content

To index the main content area of a documentation site:

| Path           | Selector      |
| -------------- | ------------- |
| \*\*/docs/\*\* | main .content |

#### Different selectors for different sections

You can define multiple entries to apply different selectors to different parts of your site. The first matching path wins, so place more specific patterns first:

| Path                    | Selector           |
| ----------------------- | ------------------ |
| \*\*/blog/releases/\*\* | .release-notes     |
| \*\*/blog/\*\*          | article .post-body |
| \*\*/docs/\*\*          | main .content      |

In this example, a page at `https://example.com/blog/releases/v2` matches the first pattern and uses the `.release-notes` selector. A page at `https://example.com/blog/my-post` skips the first pattern and matches the second.

Warning

If a CSS selector does not match any elements on a page, the resulting Markdown is empty and AI Search marks the item as errored. Verify that your selectors match the expected elements before applying them to a broad set of pages.

### Interaction with other features

* **Path filtering**: [Path filtering](https://developers.cloudflare.com/ai-search/configuration/indexing/path-filtering/) takes priority over content selectors. Pages excluded by path filters are never crawled, so content selectors do not apply to them.
* **Browser Run**: Content selectors apply to the HTML that AI Search receives. For sites that render content with JavaScript, turn on [Browser Run](#rendering-mode) so that selectors can target the fully rendered DOM.
* **Automatic re-indexing**: Updating content selectors triggers a new [sync job](https://developers.cloudflare.com/ai-search/configuration/indexing/) immediately, so changes are applied to all indexed pages.

### Limits

| Limit                            | Value          |
| -------------------------------- | -------------- |
| Maximum content selector entries | 10             |
| Maximum path pattern length      | 200 characters |
| Maximum selector length          | 200 characters |

## Best practices for robots.txt and sitemap

Configure your `robots.txt` and sitemap to help AI Search crawl your site efficiently.

### robots.txt

The AI Search crawler uses the user agent `Cloudflare-AI-Search`. Your `robots.txt` file should reference your sitemap and allow the crawler:

robots.txt

```

User-agent: *

Allow: /


Sitemap: https://example.com/sitemap.xml


```

You can list multiple sitemaps or use a sitemap index file:

robots.txt

```

User-agent: *

Allow: /


Sitemap: https://example.com/sitemap.xml

Sitemap: https://example.com/blog-sitemap.xml

Sitemap: https://example.com/sitemap.xml.gz


```

To block all other crawlers but allow only AI Search:

robots.txt

```

User-agent: *

Disallow: /


User-agent: Cloudflare-AI-Search

Allow: /


Sitemap: https://example.com/sitemap.xml


```

### Sitemap

Structure your sitemap to give AI Search the information it needs to crawl efficiently:

sitemap.xml

```

<?xml version="1.0" encoding="UTF-8"?>

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <url>

    <loc>https://example.com/important-page</loc>

    <lastmod>2026-01-15</lastmod>

    <changefreq>weekly</changefreq>

    <priority>1.0</priority>

  </url>

  <url>

    <loc>https://example.com/other-page</loc>

    <lastmod>2026-01-10</lastmod>

    <changefreq>monthly</changefreq>

    <priority>0.5</priority>

  </url>

</urlset>


```

Use these attributes to control crawling behavior:

| Attribute    | Purpose                       | Recommendation                                                                                      |
| ------------ | ----------------------------- | --------------------------------------------------------------------------------------------------- |
| <loc>        | URL of the page               | Required. Use full or partial URLs.                                                                 |
| <lastmod>    | Last modification date        | Include to enable change detection. AI Search re-crawls pages when this date changes.               |
| <changefreq> | Expected change frequency     | Use when <lastmod> is not available. Values: always, hourly, daily, weekly, monthly, yearly, never. |
| <priority>   | Relative importance (0.0-1.0) | Set higher values for important pages. AI Search indexes pages in priority order.                   |

You can also use a Sitemap Index to bundle other, domain specific sitemaps:

sitemap-index.xml

```

<?xml version="1.0" encoding="UTF-8"?>

<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

  <sitemap>

    <loc>https://www.example.com/sitemap-blog.xml</loc>

    <lastmod>2024-08-15T10:00:00+00:00</lastmod>

  </sitemap>

  <sitemap>

    <loc>https://www.example.com/sitemap-docs.xml</loc>

    <lastmod>2024-08-10T12:00:00+00:00</lastmod>

  </sitemap>

</sitemapindex>


```

When parsing a Sitemap Index, AI Search collects all child sitemaps and then crawls them recursively, collecting all relevant URLs present in your sitemaps.

### Recommendations

* Include `<lastmod>` on all URLs to enable efficient change detection during syncs.
* Set `<priority>` to control indexing order. Pages with higher priority are indexed first.
* Use `<changefreq>` as a fallback when `<lastmod>` is not available.
* Use sitemap index files for large sites with multiple sitemaps.
* Compress large sitemaps using `.gz` format to reduce bandwidth.
* Keep sitemaps under 50MB and 50,000 URLs per file (standard sitemap limits).

## Allow the AI Search crawler through WAF

If you have Security rules configured to block bot activity, you can add a rule to allowlist the crawler bot.

1. In the Cloudflare dashboard, go to the **Security rules** page.  
[ Go to **Security rules** ](https://dash.cloudflare.com/?to=/:account/:zone/security/security-rules)
2. To create a new empty rule, select **Create rule** \> **Custom rules**.
3. Enter a descriptive name for the rule in **Rule name**, such as `Allow AI Search`.
4. Under **When incoming requests match**, use the **Field** drop-down list to choose _Bot Detection ID_. For **Operator**, select _equals_. For **Value**, enter `122933950`.
5. Under **Then take action**, in the **Choose action** dropdown, choose _Skip_.
6. Under **Place at**, select the order of the rule in the **Select order** dropdown to be _First_. Setting the order as _First_ allows this rule to be applied before subsequent rules.
7. To save and deploy your rule, select **Deploy**.

## Limits and pricing

The regular AI Search [limits](https://developers.cloudflare.com/ai-search/platform/limits-pricing/) apply when using the Website data source.

The crawler will download and index pages only up to the maximum object limit supported for an AI Search instance, and it processes the first set of pages it visits until that limit is reached. In addition, any files that are downloaded but exceed the file size limit will not be indexed.

For instances with [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/), Browser Run and storage are included. For older instances, [R2](https://developers.cloudflare.com/r2/pricing/), [Vectorize](https://developers.cloudflare.com/vectorize/platform/pricing/), and [Browser Run](https://developers.cloudflare.com/browser-run/pricing/) are billed separately.

```json
{"@context":"https://schema.org","@type":"BreadcrumbList","itemListElement":[{"@type":"ListItem","position":1,"item":{"@id":"/directory/","name":"Directory"}},{"@type":"ListItem","position":2,"item":{"@id":"/ai-search/","name":"AI Search"}},{"@type":"ListItem","position":3,"item":{"@id":"/ai-search/configuration/","name":"Configuration"}},{"@type":"ListItem","position":4,"item":{"@id":"/ai-search/configuration/data-source/","name":"Data source"}},{"@type":"ListItem","position":5,"item":{"@id":"/ai-search/configuration/data-source/website/","name":"Website"}}]}
```
