Getting Started Guide

--- title: Website description: Connect a domain you own as a data source so AI Search can crawl and index your website pages. image: https://developers.cloudflare.com/dev-products-preview.png --- > Documentation Index > Fetch the complete documentation index at: https://developers.cloudflare.com/ai-search/llms.txt > Use this file to discover all available pages before exploring further. [Skip to content](#%5Ftop) # Website You can connect a website you own as a data source for your AI Search instance. AI Search crawls and indexes the pages automatically. You can only crawl domains that you have onboarded onto the same Cloudflare account. Refer to [Onboard a domain](https://developers.cloudflare.com/fundamentals/manage-domains/add-site/) for more information on adding a domain to your Cloudflare account. Bot protection may block crawling If you use Cloudflare products that control or restrict bot traffic such as [Bot Management](https://developers.cloudflare.com/bots/), [Web Application Firewall (WAF)](https://developers.cloudflare.com/waf/), or [Turnstile](https://developers.cloudflare.com/turnstile/), the same rules will apply to the AI Search crawler. Make sure to configure an exception or an allow-list for the AI Search crawler in your settings. ## Get started You can connect a website when creating a new instance through the [dashboard](https://developers.cloudflare.com/ai-search/get-started/dashboard/), the [REST API](https://developers.cloudflare.com/ai-search/get-started/api/), or [Wrangler](https://developers.cloudflare.com/ai-search/get-started/wrangler/). Website is an optional data source that you can add alongside [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/). ## How website crawling works When you connect a domain, the crawler looks for your website's sitemap to determine which pages to visit: 1. If you configure one or more custom sitemap URLs in the dashboard under **Parser options** \> **Specific sitemap**, AI Search crawls only those sitemap URLs. 2. Otherwise, the crawler checks `robots.txt` for listed sitemaps. 3. If no `robots.txt` is found, the crawler checks for a sitemap at `/sitemap.xml`. 4. If no sitemap is available, the domain cannot be crawled. ### Indexing order If your sitemaps include `` attributes, AI Search reads all sitemaps and indexes pages based on each page's priority value, regardless of which sitemap the page is in. If no `` is specified, pages are indexed in the order the sitemaps are provided, either from the configured custom sitemap URLs or from `robots.txt` from top to bottom. AI Search supports `.gz` compressed sitemaps. Both `robots.txt` and sitemaps can use partial URLs. ### Sync and updates During scheduled or manual [sync jobs](https://developers.cloudflare.com/ai-search/configuration/indexing/syncing/), the crawler will check for changes to the `` attribute in your sitemap. If it has been changed to a date occurring after the last sync date, then the page will be crawled, the updated version is stored, and automatically reindexed so that your search results always reflect the latest content. If the `` attribute is not defined, AI Search uses the `` attribute to determine how often to re-crawl the URL. If neither `` nor `` is defined, AI Search automatically crawls each link once a day. ## Storage For instances with [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/), crawled pages are stored in managed storage automatically. For older instances created before **April 16, 2026**, AI Search creates a dedicated R2 bucket in your account to store crawled pages. This bucket is automatically managed and is used only for content discovered by the crawler. Note For instances with a dedicated R2 bucket, do not modify the bucket directly as it may disrupt the indexing flow and cause content to not be updated properly. ## Configuration ### Path filtering You can control which pages get indexed by defining include and exclude rules for URL paths. Use this to limit indexing to specific sections of your site or to exclude content you do not want searchable. Note Path filtering matches against the full URL, including the scheme, hostname, and subdomains. For example, a page at `https://www.example.com/blog/post` requires a pattern like `**/blog/**` to match. Using `/blog/**` alone will not match because it does not account for the hostname. For example, to index only blog posts while excluding drafts: * **Include:** `**/blog/**` * **Exclude:** `**/blog/drafts/**` Refer to [Path filtering](https://developers.cloudflare.com/ai-search/configuration/indexing/path-filtering/) for pattern syntax, filtering behavior, and more examples. For supported file types and size limits, refer to [Data source](https://developers.cloudflare.com/ai-search/configuration/data-source/#supported-file-types). ### Parsing options You can configure parsing options during onboarding or in your instance settings under **Parser options**. #### Specific sitemap By default, AI Search crawls all sitemaps listed in your `robots.txt` in the order they appear (top to bottom). If you do not want the crawler to index everything, or if your sitemap is hosted at a non-standard path, you can configure custom sitemap URLs in the dashboard under **Parser options** \> **Specific sitemap**. When custom sitemap URLs are configured, AI Search uses those sitemap URLs instead of auto-discovering sitemaps from `robots.txt` or `/sitemap.xml`. You can add up to five sitemap URLs. #### Rendering mode You can choose how pages are parsed during crawling: * **Static sites**: Downloads the raw HTML for each page. * **Rendered sites**: Loads pages with a headless browser and downloads the fully rendered version, including dynamic JavaScript content. For instances with [built-in storage](https://developers.cloudflare.com/ai-search/configuration/data-source/built-in-storage/), Browser Run is included. For older instances, [Browser Run](https://developers.cloudflare.com/browser-run/pricing/) limits and billing apply. ### Extra headers If your website has pages behind authentication or are only visible to logged-in users, you can configure custom HTTP headers to allow the AI Search crawler to access this protected content. You can add up to five custom HTTP headers to the requests AI Search sends when crawling your site. #### Providing access to sites protected by Cloudflare Access To allow AI Search to crawl a site protected by [Cloudflare Access](https://developers.cloudflare.com/cloudflare-one/access-controls/), you need to create service token credentials and configure them as custom headers. Service tokens bypass user authentication, so ensure your Access policies are configured appropriately for the content you want to index. The service token will allow the AI Search crawler to access all content covered by the Service Auth policy. 1. In the [Cloudflare dashboard ↗](https://dash.cloudflare.com/), [create a service token](https://developers.cloudflare.com/cloudflare-one/access-controls/service-credentials/service-tokens/#create-a-service-token). Once the Client ID and Client Secret are generated, save them for the next steps. For example they can look like: ``` CF-Access-Client-Id: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access CF-Access-Client-Secret: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ``` 2. [Create a policy](https://developers.cloudflare.com/cloudflare-one/access-controls/policies/policy-management/#create-a-policy) with the following configuration: * Add an **Include** rule with **Selector** set to **Service token**. * In **Value**, select the Service Token you created in step 1. 3. [Add your self-hosted application to Access](https://developers.cloudflare.com/cloudflare-one/access-controls/applications/http-apps/self-hosted-public-app/) and with the following configuration: * In Access policies, click **Select existing policies**. * Select the policy that you have just created and select **Confirm**. 4. In the Cloudflare dashboard, go to the **AI Search** page. [ Go to **AI Search** ](https://dash.cloudflare.com/?to=/:account/ai/ai-search) 5. Select **Create**. 6. Select **Website** as your data source. 7. Under **Parse options**, locate **Extra headers** and add the following two headers using your saved credentials: * Header 1: * **Key**: `CF-Access-Client-Id` * **Value**: `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.access` * Header 2: * **Key**: `CF-Access-Client-Secret` * **Value**: `xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx` 8. Complete the AI Search setup process to create your search instance. ## Custom metadata You can attach custom metadata to web pages using HTML `` tags. AI Search extracts metadata from the `` section of each crawled page. Before custom metadata can be extracted, you must [define a schema](https://developers.cloudflare.com/ai-search/configuration/indexing/metadata/#define-a-schema) in your AI Search configuration. ### Add metadata to web pages Add `` tags using either the `name` or `property` attribute: ``` ``` ### Recognized fields For the following fields, AI Search knows which meta tags to extract from. You must still define these in your schema to enable extraction. | Field | Source | | ----------- | ------------------------------------------------------------- | | title | or | | description | or | | image | | When both a standard meta tag and an Open Graph tag are present, the standard meta tag takes precedence. ### How metadata extraction works When the crawler fetches a page: 1. All `` tags with `name` or `property` attributes are parsed from the `` section. 2. Tag names are matched against your schema (case-insensitive). 3. The `content` attribute value is cast to the configured data type. 4. Extracted metadata is stored alongside the cached HTML. 5. On subsequent processing, metadata flows into the vector index. ### Boolean value parsing For `boolean` fields, the following values are accepted (case-insensitive): | True values | False values | | ------------ | ------------ | | true, 1, yes | false, 0, no | Any other value is treated as invalid and the field is omitted. ## Content selectors Content selectors let you control which parts of a crawled page are indexed. Each entry pairs a URL glob pattern with a CSS selector. When a page URL matches a glob pattern, only the elements matching the corresponding CSS selector — and their descendants — are extracted and converted to Markdown for indexing. The list is ordered and the **first matching path wins**. If a page URL matches multiple glob patterns, only the selector from the first match is applied. Order your entries from most specific to least specific. ### Default behavior Without content selectors, AI Search applies a default processing pipeline that removes elements such as `

`, `