Tl;dr: the AI boom has created an increase in the number of bots crawling websites to harvest their content. This has impaired the performance of some of the sites we host for our customers. We’re doing our best to understand the problem, and to block the bots wherever we can, including using services like Cloudflare.
If you’d like to know more read on, and if you are a customer and would like to tell us specifically how you’d like us to handle AI bots harvesting your site, please get in touch.
Over the last several months the websites we host have been subjected to an ever-increasing load of traffic from web crawlers – the traditional, old ones that index your site in search engines like Google or Bing, but also others that use your site’s contents to feed large language models (LLMs) and other AI.
It is of course increasingly difficult to say what is “old” search and what is this supposedly new-fangled “artificial intelligence” – any regular Google search returns results that are heavily informed by AI-based processes, and if you ask your Alexa device to find out something for you, that is really a search that uses AI at various levels to understand your voice, parse the words you say and use them as a cue to interpret all of the data the “Amazonbot” has gathered from the Web. Generative AI is another thing, but it can be powered by the same large-language models.
But people often feel quite differently about the idea that their site will show up in a relevant search (whether in Google, or in a conversation with their Echo), versus the thought that the sum of the knowledge in their website will be used by an artificial agent to create something supposedly “new” (but probably uncredited and quite possibly inaccurate or harmful). I know I see them quite differently.
All of which is to say that it’s hard to know which of these crawlers are ones that we wish to invite in and which we should restrict or repel. Not only are the purposes of these bots multiplying in variety but their number and the load they impose are exploding.
So, from a practical perspective, but also in some vague hope of holding back the tide of copyright theft and moral confusion, we at The Museum Platform need to put a dampener on it.
The question of moral rights
Lots of heritage organisations are in a quandary regarding the use of the content for which they are responsible within the unfamiliar context of LLMs and, in particular, generative AI. Companies working in this area would love to use the cultural and other works of which our clients are guardians as the raw material for “novel” creations in some way, whether this is the production of “art” or of answers to users’ questions.
There are moral questions about whether and when this is appropriate, who should decide, who has a stake in the results (and rewards), and so on. And the legal side will inevitably lag decades behind the ethics, if as a society (or globe) we can ever agree on the ethics anyway.
All of which means it’s hard for a museum to be unreservedly positive about the prospect of all their intellectual assets being hoovered up, to be reused untraceably and unaccountably for any purpose whatsoever by companies who won’t share what they are doing with them – or who even claim they cannot know.
The question of cost and performance
Several of our sites have unfortunately experienced episodes of poor performance over recent months, for which we are truly sorry. Upon investigation this has almost always turned out to be a consequence of the server they are sitting on hosting at least one site that is being hit hard by bots gathering material for AI engines.
When a site is hit by 100 concurrent requests from a bot which is scraping all of its data to feed into the AI machine, we see things slowing down or even grinding to a complete halt. We’re doing what we can to throttle this impact, and we’re also doing what we can to cushion any costs and not pass these on to our customers. We have seen single websites have hundreds of gigabytes of traffic in a single month, entirely comprised of HTML – this equates to many millions of pages, and hundreds of times the traffic we’d expect otherwise. This has a financial cost for us, but the worst thing is that hammering the server so hard slows it to a crawl – memory fills up, then the processor maxes out, then everything grinds to a halt and/or falls over.
This affects every site on the server, not just the one that is receiving the unwanted attention. A few of our clients have a server to themselves (sometimes we run several sites for them) which means that they’re only affected if their own sites are “botted”, but that’s hardly consolation when it happens!
What we are doing
There are dozens of AI crawlers out there. Some are supposedly well behaved if you tell them that you don’t want them to take your content; others ignore that and do all they can to get around any barriers. There’s no way we can keep on top of it all but we’re finding tools to help improve the situation. Chief amongst these are:
- Firewall rules on the server. These block requests from known “bad” IP addresses. It’s nearly impossible to keep on top of these, but they can help on a case-by-case basis when we see a specific site or server struggling.
- robots.txt, a file that each website provides to tell bots what content they are welcome to use and what they should leave alone. We have a list of bot names that we’re adding to these to tell them to not scrape your content. But this relies on the bot being a “good citizen” – in other words, they may not respect our request.
- Cloudflare and Cloudways tools. We are now in a position where almost all our sites come through Cloudflare at some point. This incredible service has a wealth of defence mechanisms and is developing ways to block AI bots en masse.
Not all these approaches suit all sites, and our response depends on a lot of factors. But the main takeaway is that our approach is simple: to try and block all AI bots, for all websites, all of the time. We’re confident that for most of you this is exactly what you’ll want, but if you have any questions or doubts do please ask. But for us this is a question of performance (and cost) for everyone, even putting aside the vexed IP questions.
NB: Just as a note, there is another web crawler out there that isn’t (we think) actually serving an AI engine but which is proving a right pain in the neck. It’s from a company called Ahrefs, which claims to crawl and index trillions (yes trillions) of links in aid of its SEO work. That is, its job is to help people optimise how their sites look in other search engines. You would have to pay them to use their service, but they don’t pay you to drain your service. We are not going to help them: Ahrefs will be blocked (or disallowed) too.