Geo Fundamentals

The 22 AI Crawlers Every Local Business Should Allow (and the Robots.txt to Copy)

May 18, 20269 min read2,080 words

Anthony (Tony) Velte

Founder & Principal · Author of 12+ books

The fastest way to become visible to AI search platforms is to make sure their crawlers can actually read your site. Most websites still default to blocking GPTBot, and most do not list the other 21 AI user-agents at all. Local businesses that want to be cited by ChatGPT, Claude, Perplexity, and Google AI Overviews need to explicitly allow 22 specific crawlers. This post lists every one, explains who runs it and what it does, and ends with a copy-paste robots.txt block you can drop onto your site today.

Why "Block GPTBot" Is the Wrong Default for Local Businesses

In late 2023 a wave of opinion pieces argued that website owners should block GPTBot to prevent OpenAI from "training on their content." A lot of small businesses copied that robots.txt rule without thinking through what it actually costs. The trade-off is straightforward: blocking GPTBot does not retroactively scrub your content from models that already exist, and it does prevent ChatGPT — one of the most widely used AI assistants, with a weekly user base measured in the hundreds of millions (Gartner estimates over 800 million weekly users across AI assistants) — from retrieving your business in real time when someone asks it a question.

For a content publisher selling syndicated articles, the calculus can go either way. For a local business whose goal is to be discovered, the calculus is one-sided. AI crawler access is one of the six dimensions of our SignalScore methodology specifically because shutting out crawlers is a self-inflicted wound. Local businesses competing for citation in AI answers should be doing the opposite of what the publishing-industry advice column told them to do — they should be explicitly welcoming every legitimate AI crawler by name.

AI Crawler Access carries a 10% weight in SignalScore. It is the cheapest dimension to fix — a single deploy of an updated robots.txt — and one of the highest-leverage, because no other GEO work matters if the crawler cannot reach the page.

The 22 Crawlers, Grouped by Operator

The list below mirrors the canonical allowlist that LocalStar Digital runs on its own site — you can fetch the live version any time at localstardigital.com/robots.txt. Each entry covers who operates the crawler, what it is fetching for, and why a local business should let it in.

OpenAI (3 crawlers)

1. GPTBot

GPTBot is OpenAI's primary web crawler. It harvests publicly available content for use in training and improving OpenAI's models, and OpenAI documents the user-agent string and IP ranges on its platform. The relevant question for a local business is not whether GPTBot indexes you for training — it is whether the corpus of content that future ChatGPT models are built on contains your business or not. Blocking GPTBot guarantees the answer is no. Allowing it gives you the chance to be part of the substrate of future AI responses, including local-business recommendation contexts.

2. ChatGPT-User

ChatGPT-User is the user-agent ChatGPT uses when it fetches a page in real time on behalf of someone in an active conversation — for example, when a user clicks a citation link, or when ChatGPT browses the web to answer a question. This is the live-retrieval surface, not the training surface, and blocking it is the most directly damaging block a local business can apply. If a customer asks ChatGPT "who is the best [service] in [city]" and ChatGPT tries to fetch your site to verify a claim before citing you, you want that request to succeed.

3. OAI-SearchBot

OAI-SearchBot is the crawler behind OpenAI's search product (SearchGPT and the search-grounded variants of ChatGPT). It indexes pages for inclusion in OpenAI's search index — the same kind of role Googlebot plays for Google search. Allowing it is how your business becomes eligible to surface in ChatGPT's search-grounded responses. Blocking it removes you from that index entirely.

Anthropic (4 crawlers)

4. ClaudeBot

ClaudeBot is Anthropic's general-purpose web crawler. Like GPTBot, it gathers publicly available content that may inform model training and improvement. Anthropic publishes guidance for site owners on its documentation and support sites, including the user-agent strings to recognize. Local businesses building for AI visibility should allow ClaudeBot for the same reason they allow GPTBot — Claude is a major AI assistant and being part of the corpus it reasons over is the floor of visibility.

5. Claude-Web

Claude-Web is the user-agent associated with Claude's live browsing capability — when Claude fetches a specific URL in real time during a conversation, this is typically the agent that appears in your logs. Functionally analogous to ChatGPT-User. Blocking it interrupts the live retrieval flow that produces the most current, best-cited answers.

6. anthropic-ai

anthropic-ai is a historical user-agent string associated with Anthropic web fetching. Some site operators have observed it intermittently alongside ClaudeBot and Claude-Web. Including it in your allowlist is a belt-and-suspenders move — if Anthropic rotates or expands its agent strings, you do not want a legitimate request silently denied because your robots.txt only listed one of the variants.

7. Claude-SearchBot

Claude-SearchBot is associated with Anthropic's search-grounded answer features. As Claude's search and tool-use capabilities expand, this agent's footprint grows. Allowing it ensures your business is reachable when Claude is constructing a search-grounded response that could cite you.

Google (2 crawlers)

8. Google-Extended

Google-Extended is the user-agent Google uses specifically for Gemini and Vertex AI training and grounding. It is separate from the main Googlebot used for traditional search — meaning your decision about Google-Extended only affects AI products. Google documents this distinction in its crawler overview and explicitly notes that disallowing Google-Extended does not affect Google Search ranking. For a local business, that means there is essentially no downside to allowing it and a real upside: visibility inside Gemini and Google AI Overviews.

9. GoogleOther

GoogleOther is a generalized agent Google uses for one-off internal research, product development, and miscellaneous fetching outside the primary search index. It is documented in Google's crawler reference. Allowing it costs nothing and ensures you are not accidentally hiding from a Google product surface that has not been carved out into its own named agent yet.

Perplexity (2 crawlers)

10. PerplexityBot

PerplexityBot is Perplexity AI's search and indexing crawler. Perplexity is fundamentally a citation-driven AI search engine — its responses almost always link out to the underlying sources — which makes it one of the most consequential AI surfaces for local-business referral traffic. Blocking PerplexityBot is, for a local business, equivalent to blocking Googlebot in 2010. There is no upside.

11. Perplexity-User

Perplexity-User is the agent associated with live fetches Perplexity makes in response to a user query — the real-time retrieval layer that backstops citations. Allow it for the same reason you allow ChatGPT-User and Claude-Web: it is the request that happens at the moment a real customer is asking a real question, and you want that fetch to succeed.

Microsoft (2 crawlers, plus msnbot)

12. bingbot

bingbot is Microsoft's primary search crawler and the upstream feed for Bing search, Bing Chat, and Microsoft Copilot grounding. Bing's share of conventional search is modest, but its share of the AI-grounding substrate is meaningful — Copilot, ChatGPT's web tool in some configurations, and several other AI products historically lean on Bing's index. Allow bingbot. (Note: msnbot is the legacy Microsoft user-agent and is worth listing as an alias for older traffic — included in the allowlist but not counted in the 22 because it is functionally a duplicate of bingbot.)

13. CopilotBot

CopilotBot is the user-agent associated with Microsoft Copilot's fetching activity. As Copilot expands across Windows, Edge, Office, and standalone surfaces, this agent's reach grows. Allowing it ensures your business is reachable to a product Microsoft is pushing hard across its very large Windows and Office install base.

Mistral (1 crawler)

14. MistralAI-User

MistralAI-User is associated with Mistral's AI products fetching pages during query handling. Mistral's European market presence is significant and growing, and for local businesses with European customer reach the cost of blocking is asymmetric with the cost of allowing. Allow it.

Other AI and Discovery Crawlers (8)

15. Amazonbot

Amazonbot is Amazon's web crawler. It powers Alexa's answer features and informs other Amazon AI surfaces. The Alexa install base alone makes it worth listing — your business being reachable when an Alexa user asks a local-services question is non-trivial visibility.

16. Applebot-Extended

Applebot-Extended is the agent Apple uses for AI-product crawling, distinct from the main Applebot that powers Siri and Spotlight. With Apple Intelligence rolling out across iOS, macOS, and iPadOS, this is the agent that determines whether your business is part of Apple's on-device and cloud AI corpus. Allow it.

17. Bytespider

Bytespider is ByteDance's crawler (the company behind TikTok and Doubao). It feeds ByteDance's AI products and content discovery. For local businesses with any social-media or international consumer surface, allowing Bytespider extends your reachable corpus into a large non-Western AI ecosystem.

18. CCBot

CCBot is the Common Crawl crawler. Common Crawl produces a large open web dataset that has been a documented training-data source for major language models — OpenAI's published GPT-3 paper, for example, lists Common Crawl as its single largest training source — and it is widely reused across open-source model projects. Because so many models draw on the same open corpus, allowing CCBot is one of the higher-leverage single decisions in this list: it is upstream of multiple AI ecosystems at once.

19. cohere-ai

cohere-ai is associated with Cohere's enterprise AI products fetching pages for retrieval and grounding. Cohere's consumer footprint is small, but its enterprise integrations — including AI features inside business tools your prospective customers may be using — make it worth allowing.

20. DuckAssistBot

DuckAssistBot is associated with DuckDuckGo's AI-assist features. DuckDuckGo's privacy positioning attracts a meaningful subset of users, and DuckAssist surfaces AI summaries grounded in indexed content. Allowing this agent extends your reach into that audience.

21. meta-externalagent

meta-externalagent is Meta's crawler for AI product use. Meta AI is integrated into WhatsApp, Instagram, Facebook, and the standalone Meta AI assistant — apps with a combined user base in the billions, one of the largest consumer-AI distribution footprints anywhere. Whatever you think of Meta strategically, the size of the surface argues for allowing the crawler.

22. YouBot

YouBot is You.com's crawler. You.com is a smaller AI search engine but a citation-friendly one, with growing usage among users specifically looking for AI search that links out aggressively. The marginal cost of allowing is zero; the marginal benefit is being eligible for that citation flow.

The Copy-Paste Robots.txt Allowlist

Below is the canonical block, in standard robots.txt syntax. Drop it into your site's /robots.txt file. The leading wildcard rule preserves your existing crawl policy for everything else; the explicit per-agent allows ensure that no AI crawler is denied even if a future general rule changes. Replace the sitemap URL with your own.

Paste this into /robots.txt (one directive per line — your CMS or framework may render the formatting; the live machine-readable version we run on our own site is always at localstardigital.com/robots.txt):

User-agent: * / Allow: / / Disallow: /api/ / Disallow: /admin/ / Disallow: /thank-you/
User-agent: GPTBot / Allow: /
User-agent: ChatGPT-User / Allow: /
User-agent: OAI-SearchBot / Allow: /
User-agent: ClaudeBot / Allow: /
User-agent: Claude-Web / Allow: /
User-agent: anthropic-ai / Allow: /
User-agent: Claude-SearchBot / Allow: /
User-agent: Google-Extended / Allow: /
User-agent: GoogleOther / Allow: /
User-agent: PerplexityBot / Allow: /
User-agent: Perplexity-User / Allow: /
User-agent: bingbot / Allow: /
User-agent: msnbot / Allow: /
User-agent: CopilotBot / Allow: /
User-agent: MistralAI-User / Allow: /
User-agent: Amazonbot / Allow: /
User-agent: Applebot-Extended / Allow: /
User-agent: Bytespider / Allow: /
User-agent: CCBot / Allow: /
User-agent: cohere-ai / Allow: /
User-agent: DuckAssistBot / Allow: /
User-agent: meta-externalagent / Allow: /
User-agent: YouBot / Allow: /
Sitemap: https://your-domain.com/sitemap.xml

The slash separators above represent line breaks — each directive is its own line in the actual file. In a Next.js app, the cleanest way to ship this is via the App Router's robots.ts metadata route (the same approach LocalStar uses), which generates the file at build time from a typed configuration.

How to Verify Your Robots.txt Is Working

Shipping the rule is half of the work. Verifying that it is live and being respected is the other half. Three checks, in order of priority:

Verification checks every business should run after updating robots.txt:

curl your live robots.txt — run "curl -A GPTBot https://your-domain.com/robots.txt" and confirm the file returns the full allowlist with HTTP 200. If your CDN or framework is mangling the response, this is where you find out.
Check server logs for AI user-agents — within 24-72 hours of deploy, you should see entries from GPTBot, ClaudeBot, PerplexityBot, and friends. No hits over a week suggests the file is not being served correctly or your site has no inbound discoverability path yet.
Run a real query in ChatGPT, Claude, and Perplexity — ask each one a question your business should be the answer to. If you appear cited, the loop is closed. If not, the issue is upstream of robots.txt: citability, brand authority, or content structure — the other five SignalScore dimensions.

For a complete picture of how AI Crawler Access fits with the other five dimensions of AI visibility, see how we score it on our SignalScore page. For the full audit-and-fix engagement that addresses all six dimensions in sequence, see our GEO services. And if you want the short version of why we built this methodology in the first place, read our about page.

AI Crawler Access is a 10% slice of SignalScore. Get it right and you have removed the cheapest possible blocker to citation. Get the other 90% right too — citability, content quality, schema, technical health, and brand authority — and you have built the substrate AI assistants need to confidently recommend you. Start with the SignalScore diagnostic on our homepage and we will tell you exactly where the leverage is.

Frequently Asked Questions

Allowing GPTBot does mean your publicly available content may be included in OpenAI's training data for future models. The honest framing for a local business is that this is a feature, not a bug: being part of the training corpus is what makes your business citable inside future ChatGPT responses. "Content theft" framing was developed for publishers selling syndicated articles, where the economics are reversed. For a local services business whose content exists to attract customers, allowing legitimate AI crawlers is consistent with the underlying business model — the content is already public, and visibility is the point.

Robots.txt is the opt-in mechanism — it is how the industry has chosen to express crawler preferences for thirty years, and the major AI operators (OpenAI, Anthropic, Google, Perplexity, Microsoft) document and honor it. If you want to opt out of a specific crawler, you disallow it by name. If you want to opt in, you allow it. There is no third state where you are simultaneously discoverable and excluded. For local businesses, the right answer is almost always explicit opt-in.

For almost every local-business website, no. AI crawlers crawl at modest rates compared to Googlebot, and most local-business sites are well within their hosting plan's headroom even with every AI crawler active. If your site is on a very small shared-hosting plan and you see specific crawlers hammering it (Bytespider has historically had a heavier footprint than others), you can rate-limit at the CDN or web-server layer without dropping the crawler entirely. Outright blocking is rarely the right answer.

New AI crawlers appear periodically as new AI products launch or existing ones split out specialized agents, and existing crawlers occasionally rename or version their user-agent strings. We re-verify the LocalStar canonical allowlist on a regular cadence against each operator's published documentation, and we keep the live version current at localstardigital.com/robots.txt. The simplest way to stay current is to re-fetch that file periodically and diff it against your own, or subscribe to the LocalStar newsletter for updates.

10 Things AI Search Engines Look For When Recommending Local Businesses

Citability: Why Content Quotability Is the #1 Driver of AI Search Visibility

Ready to improve your AI visibility?

Book a strategy call. We will audit your search and AI presence and recommend a plan tailored to your business.

Book a Strategy Call