Access economics: crawling, indexing, training, and the brand’s right to manage its presence

Research question

How should one distinguish among content access modes — search, AI answer, training, and agentic use — and why is this now an economic question, not just a technical one.

Evidence type

Google and OpenAI documents on crawlers and access rights, Cloudflare materials, and research on the changing economics of content consumption.

Freshness of factual claims

The facts and examples reflect the market regime of 2025–2026.

The old contract between the site and the bot has broken down

In the old web economy, allowing a bot onto a site was treated as an almost unconditional benefit. Search crawling led to indexing, indexing led to visibility, visibility led to traffic, and traffic led to advertising, subscriptions, or sales. It was a crude model, but it worked long enough to become almost a natural law of the internet. Answer systems disrupted precisely that law. Now the same text can participate in several chains at once: it can support a search answer, serve as training material for a model, be used to “ground” an answer at query time, or be retrieved through a direct user-initiated action. These chains look similar technically, but they differ economically. Which means the question of access to content stops being binary. It no longer sounds like “do we let the bot in or not?” It breaks down into a harder question: “which bot, for what purpose, and on what terms are we prepared to admit?”

To discuss this seriously, one has to distinguish at least four access modes. The first is crawling and indexing for search visibility. The second is the use of content to train future models. The third is the use of a search index or a web document to answer at the moment of the query — in other words, to ground the answer operationally. The fourth is user-initiated access to the site, when the system itself acts as an intermediary for the user’s request. If these modes are mixed into one mass, the brand loses control and starts making decisions based either on vague fears or, conversely, on naive optimism.

Four access modes and their new separation

Google and OpenAI have already, in effect, formalized this distinction in their own rules. Google Search Central states directly that AI search features — AI Overviews and AI Mode — are governed by the same access rules as ordinary search: the key agent here remains Googlebot, while visibility restrictions in search AI features rely on familiar mechanisms such as nosnippet, data-nosnippet, max-snippet, or noindex [1]. At the same time, Google emphasizes that Google-Extended is a separate token through which a publisher can control the use of content for training future generations of Gemini and for grounding in Gemini Apps and certain cloud scenarios; Google-Extended does not affect inclusion in Google Search and is not a ranking signal [2]. A very important conclusion follows from this: at Google, search visibility and model training have already been institutionally separated. It is no longer intellectually honest to say simply “we allowed Google” or “we blocked Google” without specifying which process is actually meant.

OpenAI frames a similar distinction even more explicitly. OpenAI’s documentation says that OAI-SearchBot is responsible for the appearance of sites in ChatGPT’s search functions, GPTBot is used for training foundation models, and ChatGPT-User handles actions initiated by the user [3]. More than that, OpenAI explicitly says that a webmaster can allow OAI-SearchBot so that the site can participate in search answers while blocking GPTBot so that the content is not used for training [3]. In essence, this creates a new publisher right: the right to distinguish between useful visibility and unwanted value extraction.

This is exactly the basis on which the new access economics emerges. In 2025, Cloudflare put the problem in especially stark terms: old search crawlers and publishers were linked by a symbiotic exchange, whereas many new training bots consume content while returning almost no traffic [4]. According to Cloudflare, in June 2025 Google crawled sites roughly 14 times for every one referral, whereas OpenAI’s crawl-to-return ratio stood at 1,700:1 and Anthropic’s at 73,000:1 [4]. Even if one allows for the fact that some referrals from apps may not be captured in the Referer header, the asymmetry is too large to dismiss as statistical noise [4]. It means that the old informal contract — “you get content, we get audience” — no longer operates automatically in many AI scenarios.

From total blocking to differentiated governance

But this is also where the brand risks falling into the opposite extreme: the temptation of total blocking. Such a decision may look morally clear, yet it is not always economically sound. If all forms of access are blocked, the result may be not only exclusion from training, but also the loss of some channels of visibility, research, and sales. There are already early empirical signals that blocking bots may be associated with lower traffic for major publishers relative to those that do not block access, although such results still require cautious interpretation [5]. The point is not that blocking is forbidden. The point is that blocking has ceased to be a neutral defensive gesture. It has become a strategic choice with multiple consequence paths.

That is why a mature brand position has to be differentiated. If a company wants to be visible in ChatGPT Search but does not want its texts used to train future models, that is already technically possible through separate rules for OAI-SearchBot and GPTBot [3]. If a brand has no objection to participating in Google Search and AI Overviews but does not want content to be used for Gemini training, that can be expressed through a combination of allowing Googlebot and restricting Google-Extended [1][2]. In other words, the market is gradually moving toward a regime of fine-grained tuning of access rights rather than a crude “yes” or “no.”

Against that backdrop, attempts to turn access to content into a transactional object are of particular interest. In the summer of 2025, Cloudflare introduced a pay per crawl model in which a domain owner can choose one of three modes for a specific bot: allow access for free, charge for crawling, or block completely [6]. For now, this is more of an infrastructure experiment than a mass standard. But its significance is hard to overstate. For the first time, it makes visible the fact that crawling no longer has to remain a free gift. If an AI company extracts value from someone else’s content outside the logic of traffic return, then the question of the price of that access becomes entirely rational.

There is another practical side to the problem that is rarely discussed in public. Many sites still do a poor job of formalizing their own rules of engagement with bots. Cloudflare notes that only about 37% of the largest domains have a robots.txt file at all, and among existing robots.txt files, restrictions on the key AI agents are surprisingly rare [4]. That means a significant share of the internet entered the new era without having articulated its own legal and technical position. Companies debate AI as a global cultural problem, yet at the infrastructure level they have not even stated their own “yes” or “no” in a format machines can read.

Content as an asset with access terms

For brands, this is not an abstract legal issue. It is a question of the cost and role of content. Some materials are created as marketing assets for maximum distribution. Others are research assets that required investment, so the brand may want to limit free extraction. Still others function as commercial catalogs, where up-to-date visibility matters most. And others serve as operational documentation that should be shown only in specific scenarios. A modern access strategy has to distinguish among at least these classes and assign them different participation modes in the answer environment.

For ai100, the topic of access economics is especially rich from a research perspective. It allows one to build an observation base across several layers at once: which agents actually access the site, how robots.txt is configured, where access is allowed and where it is restricted, how that affects the brand’s visibility in answer systems, and how crawl volumes relate to actual return traffic or commercial interest. Over time, this material may become one of the most valuable assets in the entire base, because much of the market still discusses AI access in moral categories rather than in terms of a measurable architecture of value exchange.

The main conclusion here is fairly strict. In the new environment, content is no longer merely a message. It is an asset with several channels of value extraction. It can bring in a customer, shape a machine answer, train a future model, or become a good for which the publisher will sooner or later ask compensation. That is why the brand’s right to manage its presence is not the right to disappear. It is the right to choose the specific mode in which its knowledge will participate in the economics of AI. And in the coming years, the winners will not be those who are the loudest in outrage or enthusiasm, but those who build a calm, precise, and technically competent access policy for their own knowledge.

What seems well established

It is already well established that major platforms separate search crawling from training, and that a brand can configure access to those modes differently. The economic asymmetry between crawling and returned traffic has also been publicly documented.

What still remains uncertain

What is far less certain is what market mechanisms for charging for crawling will ultimately become and how quickly they will turn into a mass norm. Here the market is still in an experimental stage.

What this changes in practice

For a company, the practical implication is that access policy needs to become part of content strategy and engineering architecture, not a random set of lines in robots.txt.

Sources

[1] Google Search Central. AI Features and Your Website. 2025-2026

[2] Google for Developers. Google's common crawlers - Google-Extended. 2025-2026

[3] OpenAI Developers. Overview of OpenAI Crawlers. 2026

[4] Cloudflare Blog. Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content. 2025

[5] Zhao H., Berman R. The Impact of LLMs on Online News Consumption and Production. 2026

[6] Cloudflare Blog. Introducing pay per crawl: Enabling content owners to charge AI crawlers for access. 2025

Related materials

Research article 7 min

Update lag: how quickly AI systems change their view of a company after news, a product launch, or a price change

Why there is a time gap between a fact changing about a brand and its stable appearance in machine answers — and how to observe this lag in practice.

Open the material →

Research article 7 min

Machine-readable commercial infrastructure: markup, product feeds, and catalogs as a language AI can understand

The data and markup layer that makes a brand and its products understandable to machines: catalogs, product feeds, structured descriptions, and their synchronization.

Open the material →

Next step

How web extraction affects the AI100 result

Content access modes determine what the AI will actually see. In the AI100 report, web boost shows how much the answer changes when the model gets access to external sources — and this is a separate important metric.

See how web boost works in the methodology →