Here’s how you can block AI ChatGPT from crawling your website within 5 mintues.There are growing worries about the difficulty in preventing one’s online content from being utilized for training large language models (LLMs) such as ChatGPT. While methods exist to restrict this usage, they aren’t particularly user-friendly or completely reliable.
OpenAI has released official Robots.txt guidelines for restricting GPTBot access.
GPTBot serves as OpenAI’s web crawler. According to OpenAI, this bot may traverse the internet to enhance their systems’ capabilities.
OpenAI hasn’t explicitly stated whether GPTBot is involved in creating training datasets for ChatGPT. This ambiguity is worth noting if you’re considering blocking GPTBot to prevent your content from being included in OpenAI’s training data, as this may not achieve your intended goal.
It’s also worth mentioning that CommonCrawl, a publicly available dataset, already performs comprehensive web crawling, making it redundant for OpenAI to duplicate this effort.
Additional information about blocking CommonCrawl is provided later in this article.
The complete GPTBot user agent string is:
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
GPTBot can be restricted through robots.txt using these lines:
User-agent: GPTBot Disallow: /
GPTBot respects the following commands that regulate which website sections are permitted or forbidden for crawling.
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
OpenAI provides specific IP ranges to authenticate the legitimate GPTBot (distinguishing it from impersonating crawlers).
While these IP ranges can be blocked via .htaccess, they’re subject to change, necessitating regular .htaccess file updates.
This bears repeating: IP ranges may change, so it’s crucial to stay informed about the current ranges.
Therefore, using the range to verify the user agent and implementing robots.txt blocking is more practical.
As of 08-09-2023, these are the current GPTBot IP ranges:
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28
Large Language Models (LLMs) rely on diverse training datasets sourced from multiple origins. Many of these datasets are publicly accessible and are commonly utilized for AI training purposes.
Generally speaking, Large Language Models draw from an extensive array of training sources.
The primary sources typically include:
Wikipedia
Government court records
Books
Emails
Web-crawled content
Numerous platforms and online repositories exist that provide access to substantial collections of datasets.
One notable example is the Registry of Open Data on AWS, an Amazon-hosted portal that provides access to thousands of valuable datasets.
This Amazon repository, while significant, represents just one of many platforms offering extensive dataset collections.
According to Wikipedia’s documentation, there are 28 distinct portals dedicated to dataset distribution, including prominent platforms like the Google Dataset Search and Hugging Face’s dataset repositories.
ChatGPT is built upon GPT-3.5, alternatively referred to as InstructGPT.
The training datasets for GPT-3.5 mirror those used in GPT-3’s development. The key distinction lies in GPT-3.5’s implementation of reinforcement learning from human feedback (RLHF).
The research paper “Language Models are Few-Shot Learners” (PDF) outlines five primary datasets used in training both GPT-3 and GPT-3.5, detailed on page 9.
These datasets comprise:
Common Crawl (filtered)
WebText2
Books1
Books2
Wikipedia
Among these five datasets, two are derived from Internet crawling:
Common Crawl
WebText2
WebText2 represents a proprietary OpenAI dataset, compiled by crawling Reddit links with at least three upvotes.
This methodology aims to ensure content quality and reliability through community validation.
WebText2 expands upon OpenAI’s original WebText dataset.
The initial WebText dataset contained approximately 15 billion tokens and was instrumental in GPT-2’s training.
WebText2, slightly larger at 19 billion tokens, served as training data for both GPT-3 and GPT-3.5.
While OpenAI’s WebText2 remains private, an open-source alternative called OpenWebText2 exists. This public dataset employs similar crawling patterns, likely yielding comparable URL collections to OpenAI’s WebText2.
This information may interest those curious about WebText2’s contents. OpenWebText2 provides insight into the types of URLs included.
A refined version of OpenWebText2 is accessible through one link, while its unprocessed form can be found through another.
I haven’t been able to locate specific information regarding the user agent employed by either crawler, though it’s possible they simply identify as Python-based crawlers.
Therefore, to my knowledge, there isn’t a specific user agent to block, although I cannot state this with absolute certainty.
However, what we do know is that if your website receives a Reddit link with three or more upvotes, there’s a significant probability that your content has been incorporated into both OpenAI’s proprietary WebText2 dataset and its open-source counterpart, OpenWebText2.
For additional details about OpenWebText2, you can refer to this resource.
Among the most extensively utilized Internet content datasets is the Common Crawl collection, maintained by a non-profit organization bearing the same name.
Common Crawl’s data is accumulated through an automated bot that systematically traverses the Internet.
Organizations interested in utilizing this data download and subsequently process it, removing low-quality and spam content.
The crawler responsible for gathering Common Crawl data is designated as CCBot.
CCBot adheres to the robots.txt protocol, enabling website owners to prevent Common Crawl from accessing their content and subsequently including it in future datasets.
However, if your website has previously been crawled, it’s probable that your content already exists within various datasets.
Nonetheless, implementing Common Crawl blocking measures can prevent your website’s content from being included in future datasets derived from newer Common Crawl collections.
This explains my earlier statement at the article’s beginning about the process being “neither straightforward nor guaranteed to work.”
The CCBot User-Agent string is:
CCBot/2.0
To block the Common Crawl bot, add these lines to your robots.txt file:
User-agent: CCBot
Disallow: /
You can verify legitimate CCBot activity by confirming that crawling originates from Amazon AWS IP addresses.
CCBot also respects nofollow directives in robots meta tags.
Implement this robots meta tag:
Various datasets, including Common Crawl, are potentially utilized by companies that analyze and categorize URLs for advertising purposes.
For instance, Alpha Quantum provides a URL dataset categorized according to the Interactive Advertising Bureau Taxonomy, valuable for AdTech marketing and contextual advertising. Being excluded from such databases might result in publishers missing potential advertising opportunities.
While search engines and Common Crawl provide opt-out mechanisms for crawling, there’s currently no method to remove website content from existing datasets.
Additionally, it appears that research scientists haven’t established protocols allowing website publishers to opt out of their crawling activities.
The article, Is ChatGPT Use Of Web Content Fair? delves into the ethical implications of utilizing website data without explicit consent or providing an opt-out mechanism.
In the coming months, content publishers might benefit from having greater control over how their digital assets are utilized, particularly by artificial intelligence platforms such as ChatGPT.
The future trajectory of this development remains uncertain at present.