[chore] Update robots.txt

This updates the robots.txt based on the list of the ai.robots.txt
repository. We can look at automating that at some point.

It's worth pointing out that some robots, namely the ones by Bytedance,
are known to ignore robots.txt entirely.
This commit is contained in:
Daniele Sluijters 2024-04-21 14:24:48 +02:00
parent b7c629a18a
commit a4a178ec7a
3 changed files with 45 additions and 33 deletions

13
docs/admin/robots.md Normal file
View file

@ -0,0 +1,13 @@
# Robots.txt
GoToSocial serves a `robots.txt` file on the host domain. This file contains rules that attempt to block known AI scrapers, as well as some other indexers. It also includes some rules to ensure things like API endpoints aren't indexed by search engines since there really isn't any point to them.
## AI scrapers
The AI scrapers come from a [community maintained repository][airobots]. It's manually kept in sync for the time being. If you know of any missing robots, please send them a PR!
A number of AI scrapers are known to ignore entries in `robots.txt` even if it explicitly matches their User-Agent. This means the `robots.txt` file is not a foolproof way of ensuring AI scrapers don't grab your content.
If you want to block these things fully, you'll need to block based on the User-Agent header in a reverse proxy until GoToSocial can filter requests by User-Agent header.
[airobots]: https://github.com/ai-robots-txt/ai.robots.txt/

View file

@ -29,34 +29,37 @@ const (
robotsTxt = `# GoToSocial robots.txt -- to edit, see internal/web/robots.go robotsTxt = `# GoToSocial robots.txt -- to edit, see internal/web/robots.go
# More info @ https://developers.google.com/search/docs/crawling-indexing/robots/intro # More info @ https://developers.google.com/search/docs/crawling-indexing/robots/intro
# Before we commence, a giant fuck you to ChatGPT in particular. # AI scrapers and the like.
# https://platform.openai.com/docs/gptbot # https://github.com/ai-robots-txt/ai.robots.txt/
User-agent: GPTBot User-agent: AdsBot-Google
Disallow: / User-agent: Amazonbot
User-agent: anthropic-ai
# As of September 2023, GPTBot and ChatGPT-User are equivalent. But there's no telling User-agent: Applebot
# when OpenAI might decide to change that, so block this one too. User-agent: AwarioRssBot
User-agent: ChatGPT-User User-agent: AwarioSmartBot
Disallow: / User-agent: Bytespider
# And a giant fuck you to Google Bard and their other generative AI ventures too.
# https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
User-agent: Google-Extended
Disallow: /
# Block CommonCrawl. Used in training LLMs and specifically GPT-3.
# https://commoncrawl.org/faq
User-agent: CCBot User-agent: CCBot
Disallow: / User-agent: ChatGPT-User
User-agent: ClaudeBot
# Block Omgilike/Webz.io, a "Big Web Data" engine. User-agent: Claude-Web
# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/ User-agent: cohere-ai
User-agent: Omgilibot User-agent: DataForSeoBot
Disallow: /
# Block Faceboobot, because Meta.
# https://developers.facebook.com/docs/sharing/bot
User-agent: FacebookBot User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GPTBot
User-agent: ImagesiftBot
User-agent: magpie-crawler
User-agent: Meltwater
User-agent: omgili
User-agent: omgilibot
User-agent: peer39_crawler
User-agent: peer39_crawler/1.0
User-agent: PerplexityBot
User-agent: PiplBot
User-agent: Seekr
User-agent: YouBot
Disallow: / Disallow: /
# Well-known.dev crawler. Indexes stuff under /.well-known. # Well-known.dev crawler. Indexes stuff under /.well-known.
@ -64,11 +67,6 @@ Disallow: /
User-agent: WellKnownBot User-agent: WellKnownBot
Disallow: / Disallow: /
# Block Amazonbot, because Amazon.
# https://developer.amazon.com/amazonbot
User-agent: Amazonbot
Disallow: /
# Rules for everything else. # Rules for everything else.
User-agent: * User-agent: *
Crawl-delay: 500 Crawl-delay: 500

View file

@ -118,6 +118,7 @@ nav:
- "admin/signups.md" - "admin/signups.md"
- "admin/federation_modes.md" - "admin/federation_modes.md"
- "admin/domain_blocks.md" - "admin/domain_blocks.md"
- "admin/robots.md"
- "admin/cli.md" - "admin/cli.md"
- "admin/backup_and_restore.md" - "admin/backup_and_restore.md"
- "admin/media_caching.md" - "admin/media_caching.md"