Bots are currently scraping the internet for LLM training data at unprecedented rates[1][2][3], driving up costs and destabilizing public-facing websites. I want to talk about how this has been particularly difficult for wikis, and has gotten much worse in the last few months.
iocaine doesn’t stop them, but it uses minimal resources and makes me feel better about serving pages to them.
It can stop them nowadays, by firewalling some of the crawlers off. The reason it doesn’t stop them by default is because it serves them poisoned URLs, which it can later identify if the crawlers come back riding a headless Chrome. But once they do that, and hit a poisoned URL, there’s little reason to let them wander in an endless maze further: serve one request, and block the IP.
I’ve been running that on my own infra, and my daily number of requests went down from ~50+ million to… 2 million.
Never heard of it, but I see Anubis pretty widely adopted especially among open source projects.