AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders

(theregister.com)

149 points | by rntn 5 hours ago ago

65 comments

pjc50 an hour ago ago
Place alongside https://news.ycombinator.com/item?id=44962529 "Why are anime catgirls blocking my access to the Linux kernel?". This is why.
AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.
[-]
- superkuh 5 minutes ago ago
  This isn't AI damaging anything. This is corporations damaging things. Same as it ever was. No need for scifi non-human persons when legal corporate persons exist. They latch on to whatever big new thing in tech that people don't understand which comes along and brand themselves with it and cause damage trying to make money; even if they mostly fail at it. And for most actual humans they only ever see or interact with the scammy corporation versions of $techthing and so come to believe $techthing = corporate behavior.
- rnhmjoj an hour ago ago
  I'm far from being an AI enthusiast as anyone can be, but this issue has nothing to do with AI specifically. It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished conventions (respecting robots.txt, using a proper UA string, rate limiting, whatever). This situation could have easily happened earlier than the AI boom, for different reasons.
  [-]
  - mostlysimilar 25 minutes ago ago
    But it didn't, and it's happening now, because of AI.
bwb an hour ago ago
My book discovery website shepherd.com is getting hammered every day by AI crawlers (and crashing often)... my security lists in CloudFlare are ridiculous and the bots are getting smarter.
I wish there were a better way to solve this.
[-]
- skydhash an hour ago ago
  If you're not updating the publicly accessible part of the database open, try to see if you can put some cache strategy up and let cloudflare take the hit.
  [-]
  - bwb 33 minutes ago ago
    Yep, all but one page type is heavily cached at multiple levels. We are working to get the rest and improve it further... just annoying as they don't even respect limits..
- weaksauce 22 minutes ago ago
  put a honeypot link in your site that only robots will hit because it’s hidden. make sure it’s not in robots.txt or ban it if you can in robots.txt. setup a rule that any ip that hits that link will get a 1 day ban in your fail2ban or the like.
tehwebguy 2 hours ago ago
This is a feature! If half the internet is nuked and the other half put up fences there is less readily available training data for competitors.
mcpar-land an hour ago ago
My worst offender for scraping one of my sites was Anthropic. I deployed an ai tar pit (https://news.ycombinator.com/item?id=42725147) to see what it would do it with it, and Anthropic's crawler kept scraping it for weeks. I calculated the logs and I think I wasted nearly a year of their time in total, because they were crawling in parallel. Other scrapers weren't so persistent.
[-]
- Group_B 7 minutes ago ago
  That's hilarious. I need to set up one of these myself
rco8786 4 hours ago ago
OpenAI straight up DoSed a site I manage for my in-laws a few months ago.
[-]
- muzani 3 hours ago ago
  What is it about? I'm curious what kinds of things people ask that floods sites.
  [-]
  - rco8786 an hour ago ago
    The site is about a particular type of pipeline cleaning (think water/oil pipelines). I am certain that nobody was asking about this particular site or even the industry its in 15,000 times a minute 24 hours a day.
    It's much more likely that their crawler is just garbage and got stuck into some kind of loop requesting my domain.
  - average_r_user 3 hours ago ago
    I suppose that they just keep referring to the website in their chats, and probably they have selected the search function, so before every reply, the crawler hits the website
neilv 44 minutes ago ago
Don't the companies in the headlines pay big bucks for people working on "AI"?
Maybe they are paying big bucks for people who are actually very bad at their jobs?
Why would the CEOs tolerate that? Do they think it's a profitable/strategic thing to get away with, rather than a sign of incompetence?
When subtrees of the org chart don't care that they are very bad at their jobs, harmed parties might have to sue to get the company to stop.
internet_points 4 hours ago ago
They mention anubis, cloudflare, robots.txt – does anyone have experiences with how much any of them help?
[-]
- nromiun 4 hours ago ago
  CDNs like Cloudflare are the best. Anubis is a rate limitor for small websites where you can't or won't use CDNs like Cloudflare. I have used Cloudflare on several medium sized websites and it works really well.
  Anubis's creator says the same thing:
  > In most cases, you should not need this and can probably get by using Cloudflare to protect a given origin. However, for circumstances where you can't or won't use Cloudflare, Anubis is there for you.
  Source: https://github.com/TecharoHQ/anubis
- hombre_fatal 3 hours ago ago
  CloudFlare's Super Bot Fight Mode completely killed the surge in bot traffic for my large forum.
  [-]
  - ajsnigrutin 2 hours ago ago
    And added captchas to every user with an adblock or sensible privacy settings.
    [-]
    - pjc50 an hour ago ago
      How would you suggest that such users prove they're not a crawler?
      [-]
      - ajsnigrutin an hour ago ago
        Why would they have to?
        What's wrong with crawlers? That's how google finds you, and people find you on google.
        Just put some sensible request limits per hour per ip, and be done.
        [-]
        skydhash an hour ago ago
        Or use CDN caching. That's one of the reasons they here for.
- bakugo 3 hours ago ago
  robots.txt is obviously only effective against well-behaved bots. OpenAI etc are usually well behaved, but there's at least one large network of rogue scraping bots that ignores robots.txt, fakes the user-agent (usually to some old Chrome version) and cycles through millions of different residential proxy IPs. On my own sites, this network is by far the worst offender and the "well-behaved" bots like OpenAI are barely noticeable.
  To stop malicious bots like this, Cloudflare is a great solution if you don't mind using it (you can enable a basic browser check for all users and all pages, or write custom rules to only serve a check to certain users or on certain pages). If you're not a fan of Cloudflare, Anubis works well enough for now if you don't mind the branding.
  Here's the cloudflare rule I currently use (vast majority of bot traffic originates from these countries):
```
  ip.src.continent in {"AF" "SA"} or
  ip.src.country in {"CN" "HK" "SG"} or
  ip.src.country in {"AE" "AO" "AR" "AZ" "BD" "BR" "CL" "CO" "DZ" "EC" "EG" "ET" "ID" "IL" "IN" "IQ" "JM" "JO" "KE" "KZ" "LB" "MA" "MX" "NP" "OM" "PE" "PK" "PS" "PY" "SA" "TN" "TR" "TT" "UA" "UY" "UZ" "VE" "VN" "ZA"} or
  ip.src.asnum in {28573 45899 55836}
```
timsh 2 hours ago ago
A bit off-topic but wtf is this preview image of a spider in the eye? It’s even worse than the clickbait title of this post. I think this should be considered bad practice.
xrd 3 hours ago ago
Isn't there a class action lawsuit coming from all this? I see a bunch of people here indicating these scrapers are costing real money to people who host even small niche sites.
Is the reason these large companies don't care because they are large enough to hide behind a bunch of lawyers?
[-]
- EgregiousCube 2 hours ago ago
  Under what law? It's interesting because these are sites that host content for the purpose of providing it to anonymous network users. ebay won a case against a scraper back in 2000 by claiming that the server load was harming them, but that reasoning was later overturned because it's difficult to say that server load is actual harm. ebay was in the same condition before and after a scrape.
  Maybe some civil lawsuit about terms of service? You'd have to prove that the scraper agreed to the terms of service. Perhaps in the future all CAPTCHAs come with a TOS click-through agreement? Or perhaps every free site will have a login wall?
  [-]
  - buttercraft 4 minutes ago ago
    If you put measures in place to prevent someone from accessing a computer, and they circumvent those measures, is that not a criminal offense in some jurisdictions?
- outside1234 2 hours ago ago
  Yes. There are one set of rules for us and another set of rules for anything with more than a billion dollars.
sct202 an hour ago ago
I wonder how much of the rapid expansion of datacenters is from trying to support bot traffic.
[-]
- loeg an hour ago ago
  In terms of CapEx, not much. The GPUs are much more expensive. Physical footprint? I don't know.
shinycode 4 hours ago ago
In the same time it’s so practical to ask a question and it opens 25 pages to search and summarize the answer. Before that’s more or less what I was trying to do by hand. Maybe not 25 websites because of crap SEO the top 10 contains BS content so I curated the list but the idea is the same no ?
[-]
- rco8786 3 hours ago ago
  My personal experience is that OpenAI's crawler was hitting a very, very low traffic website I manage 10s of 1000s of times a minute non-stop. I had to block it from Cloudflare.
  [-]
  - Leynos 3 hours ago ago
    Where is caching breaking so badly that this is happening? Are OpenAI failing to use etags or honour cache validity?
    [-]
    - Analemma_ 3 hours ago ago
      Their crawler is vibe-coded.
  - danaris 3 hours ago ago
    Same here.
    I run a very small browser game (~120 weekly users currently), and until I put its Wiki (utterly uninteresting to anyone who doesn't already play the game) behind a login-wall, the bots were causing massive amounts of spurious traffic. Due to some of the Wiki's data coming live from the game through external data feeds, the deluge of bots actually managed to crash the game several times, necessitating a restart of the MariaDB process.
    [-]
    - mrweasel 2 hours ago ago
      Wikis seems to attract AI bots like crazy, especially the bad kind that will attempt any type of cache invalidation available to them.
- pm215 4 hours ago ago
  Sure, but if the fetcher is generating "39,000 requests per minute" then surely something has gone wrong somewhere ?
  [-]
  - miohtama 4 hours ago ago
    Even if it is generating 39k req/minute I would expect most of the pages already be locally cached by Meta, or served statically by their respective hosts. We have been working hard on catching websites and it has been a solved problem for the last decade or so.
    [-]
    - ndriscoll 3 hours ago ago
      Could be serving no-cache headers? Seems like yet another problem stemming from every website being designed as if it were some dynamic application when nearly all of them are static documents. nginx doing 39k req/min to cacheable pages on an n100 is what you might call "98% idle", not "unsustainable load on web servers".
      The data transfer, on the other hand, could be substantial and costly. Is it known whether these crawlers do respect caching at all? Provide If-Modified-Since/If-None-Match or anything like that?
    - mrweasel 2 hours ago ago
      Many AI crawlers seems to go to great length to avoid caches, not sure why.
  - andai 4 hours ago ago
    They're not very good at web queries, if you expand the thinking box to see what they're searching for, like half of it is nonsense.
    e.g. they'll take an entire sentence the user said and put it in quotes for no reason.
    Thankfully search engines started ignoring quotes years ago, so it balances out...
jasoncartwright 2 hours ago ago
I recently, for pretty much the first time ever in 30 years of running websites, had to blanket ban crawlers. I now whitelist a few, but the rest (and all other non-UK visitors) have to pass a Cloudflare challenge [1].
AI crawlers were downloading whole pages and executing all the javascript tens of millions of times a day - hurting performance, filling logs, skewing analytics and costing too much money in Google Maps loads.
Really disappointing.
[1] https://developers.cloudflare.com/cloudflare-challenges/
jgalt212 14 minutes ago ago
about 18 months ago, our non-Google / Bing bot traffic went from single digits per cent to over 99.9% bot traffic. We tried some home-spun solutions at first, but eventually threw in the towel and put Cloudflare in front of all our publicly accessible pages. On a long term basis, this was probably the right move for us, but we felt forced into this. And the Cloudflare Managed Ruleset definitely blocks some legit traffic such that it requires a fair amount of manual tuning.
exasperaited 3 hours ago ago
Xe Iaso is my spirit animal.
> "I don't know what this actually gives people, but our industry takes great pride in doing this"
> "unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees"
> "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming."
<3 <3
okasaki an hour ago ago
I wonder if we're doing the wrong thing blocking them with invasive tools like cloudflare?
If all you're concerned about is server load, wouldn't it be better to just offer a tar file containing all of your pages they can download instead? The models are months out of date, so a monthly dumb would surely satisfy them. There could even be some coordination for this.
They're going to crawl anyway. We can either cooperate or turn it into some weird dark market with bad externalities like drugs.
breakyerself 5 hours ago ago
There's so much bullshit on the internet how do they make sure they're not training on nonsense?
[-]
- prasadjoglekar 4 hours ago ago
  By paying a pretty penny for non bullshit data (Scale Ai). That along with Nvidia are the shovels in this gold rush.
  [-]
  - danny_codes 27 minutes ago ago
    Making a lot of assumptions about the quality of scale AI.
- bgwalter 4 hours ago ago
  Much of it is not training. The LLMs fetch webpages for answering current questions, summarize or translate a page at the user's request etc.
  Any bot that answers daily political questions like Grok has many web accesses per prompt.
  [-]
  - snowwrestler 4 hours ago ago
    While it’s true that chatbots fetch information from websites in response to requests, the load from those requests is tiny compared to the volume of requests indexing content to build training corpuses.
    The reason is that user requests are similar to other web traffic because they reflect user interest. So those requests will mostly hit content that is already popular, and therefore well-cached.
    Corpus-building crawlers do not reflect current user interest and try to hit every URL available. As a result these hit URLs that are mostly uncached. That is a much heavier load.
    [-]
    - shikon7 3 hours ago ago
      But surely there aren't thousands of new corpuses built every minute.
    - bgwalter 3 hours ago ago
      Why would the Register point out Meta and OpenAI as the worst offenders? I'm sure they do not continuously build new corpuses every day. It is probably the search function, as mentioned in the top comments.
  - 8organicbits 4 hours ago ago
    Is an AI chatbot fetching a web page to answer a prompt a 'web scraping bot'? If there is a user actively promoting the LLM, isn't it more of a user agent? My mental model, even before LLMs, was that a human being present changes a bot into a user agent. I'm curious if others agree.
    [-]
    - bgwalter 4 hours ago ago
      The Register calls them "fetchers". They still reproduce the content of the original website without the website gaining anything but additional high load.
      I'm not sure how many websites are searched and discarded per query. Since it's the remote, proprietary LLM that initiates the search I would hesitate to call them agents. Maybe "fetcher" is the best term.
    - danaris 3 hours ago ago
      But they're (generally speaking) not being asked for the contents of one specific webpage, fetching that, and summarizing it for the user.
      They're going out and scraping everything, so that when they're asked a question, they can pull a plausible answer from their dataset and summarize the page they found it on.
      Even the ones that actively go out and search/scrape in response to queries aren't just scraping a single site. At best, they're scraping some subset of the entire internet that they have tagged as being somehow related to the query. So even if what they present to the user is a summary of a single webpage, that is rarely going to be the product of a single request to that single webpage. That request is going to be just one of many, most of which are entirely fruitless for that specific query: purely extra load for their servers, with no gain whatsoever.
- danaris 3 hours ago ago
  I mean...they don't. That's part of the problem with "AI answers" and such.
lostmsu 2 hours ago ago
This article and the "report" look like a submarine ad for Fastly services. At no point does it mention the human/bot/AI bot ratio, making it useless for any real insights.
delfinom 3 hours ago ago
I run a symbol server, as in, PDB debug symbol server. Amazon's crawler and a few others love requesting the ever loving shit out of it for no obvious reason. Especially since the files are binaries.
I just set a rate-limit in cloudflare because no legitimate symbol server user will ever be excessive.
[-]
- ack_complete 2 hours ago ago
  I have a simple website consisting solely of static webpages pointing to a bunch of .zip binaries. Nothing dynamic, all highly cacheable. The bots are re-downloading the binaries over and over. I can see Bingbot downloading a .zip file in the logs, and then an hour later another Bingbot instance from a different IP in the same IP range downloading the same .zip file in full. These are files that were uploaded years ago and have never retroactively changed, and don't contain crawlable contents within them (executable code).
  Web crawlers have been around for years, but many of the current ones are more indiscriminate and less well behaved.
hereme888 3 hours ago ago
I'm absolutely pro AI-crawlers. The internet is so polluted with garbage, compliments of marketing. My AI agent should find and give me concise and precise answers.
[-]
- mrweasel 2 hours ago ago
  They just don't need to hammer sites into the ground to do it. This wouldn't be an issue if the AI companies where a bit more respectful of their data sources, but they are not, they don't care.
  All this attempting to block AI scrapers would not be an issue if they respected rate-times, knew how to back of when a server starts responding to slowly, or caching frequently visited sites. Instead some of these companies will do everything, including using residential ISPs, to ensure that they can just piledrive the website of some poor dude that's just really into lawnmowers, or the git repo of some open source developer who just want to share their work.
  Very few are actually against AI-crawlers, if they showed just the tiniest amount of respect, but they don't. I think Drew Devault said it best: "Please stop externalizing your costs directly into my face"
- lionkor 2 hours ago ago
  The second I get hit with bot traffic that makes my server heat up, I would just slam some aggressive anti bot stuff infront. Then you, my friend, are getting nothing with your fancy AI agent.
  [-]
  - hereme888 2 hours ago ago
    I've never ran any public-facing servers, so maybe I'm missing the experience of your frustration. But mine, as a "consumer" is wanting clean answers, like what you'd expect when asking your own employee for information.
  - mediumsmart 2 hours ago ago
    so the fancy AI agent will have to get really fancy and mimic human traffic and all is good until the server heats up from all those separate human trafficionados - then what?