Why are anime catgirls blocking my access to the Linux kernel?

(lock.cmpxchg8b.com)

284 points | by taviso 12 hours ago ago

295 comments

johnklos 3 hours ago ago
This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
Sure, the people who make the AI scraper bots are going to figure out how to actually do the work. The point is that they hadn't, and this worked for quite a while.
As the botmakers circumvent, new methods of proof-of-notbot will be made available.
It's really as simple as that. If a new method comes out and your site is safe for a month or two, great! That's better than dealing with fifty requests a second, wondering if you can block whole netblocks, and if so, which.
This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
[-]
- agwa an hour ago ago
  It sounds like you're saying that it's not the proof-of-work that's stopping AI scrapers, but the fact that Anubis imposes an unusual flow to load the site.
  If that's true Anubis should just remove the proof-of-work part, so legitimate human visitors don't have to stare at a loading screen for several seconds while their device wastes electricity.
  [-]
  - amarant 30 minutes ago ago
    I feel like the future will have this, plus ads displayed while the work is done, so websites can profit while they profit.
  - kaszanka 36 minutes ago ago
    This is basically what most of the challenge types in go-away (https://git.gammaspectra.live/git/go-away/wiki/Challenges) do.
- psionides an hour ago ago
  The problem is that 7 + 2 on a submission form only affects people who want to submit something, Anubis affects every user who wants to read something on your site
- interstice 2 hours ago ago
  The cost benefit calculus for workarounds changes based on popularity. Your custom lock might be easy to break by a professional, but the handful of people who might ever care to pick it are unlikely to be trying that hard. A lock which lets you into 5% of houses however might be worth learning to break.
- cakealert 2 hours ago ago
  This arms race will have a terminus. The bots will eventually be indistinguishable from humans. Some already are.
  [-]
  - overfeed 2 hours ago ago
    > The bots will eventually be indistinguishable from humans
    Not until they get issued government IDs they won't!
    Extrapolating from current trends, some form of online ID attestation (likely based on government-issued ID[1]) will become normal in the next decade, and naturally, this will be included in the anti-bot arsenal. It will be up to the site operator to trust identities signed by the Russian government.
    1. Despite what Sam Altman's eyeball company will try to sell you, government registers will always be the anchor of trust for proof-of-identity, they've been doing it for centuries and have become good at it and have earned the goodwill.
    [-]
    - marcus_holmes an hour ago ago
      How does this work, though?
      We can't just have "send me a picture of your ID" because that is pointlessly easy to spoof - just copy someone else's ID.
      So there must be some verification that you, the person at the keyboard, is the same person as that ID identifies. The UK is rapidly finding out that that is extremely difficult to do reliably. Video doesn't really work reliably on all cases, and still images are too easily spoofed. It's not really surprising, though, because identifying humans reliably is hard even for humans.
      If we do it at the network level - like assigning a government-issued network connection to a specific individual, so the system knows that any traffic from a given IP address belongs to that specific individual. There are obvious problems with this model, not least that IP addresses were never designed for this, and spoofing an IP becomes identity theft.
      We also do need bot access for things, so there must be some method of granting access to bots.
      I think that to make this work, we'd need to re-architect the internet from the ground up. To get there, I don't think we can start from here.
      [-]
      - IncRnd 3 minutes ago ago
        That is already solved by governments and businesses. If you have recently attempted to log into a US government website, you were probably told that you need Login.gov or ID.me. ID.me verifies identity via driver’s license, passport, Social Security number—and often requires users to take a video selfie, matched against uploaded ID images. If automated checks fail, a “Trusted Referee” video call is offered.
        If you think this sounds suspiciously close the what businesses do with KYC, Know Your Customer, you're correct!
      - tern 43 minutes ago ago
        If you're really curious about this, there's a place where people discuss these problems annually: https://internetidentityworkshop.com/
        Various things you're not thinking of:
        - "The person at the keyboard, is the same person as that ID identifies" is a high expectation, and can probably be avoided—you just need verifiable credentials and you gotta trust they're not spoofed
        - Many official government IDs are digital now
        - Most architectures for solving this problem involve bundling multiple identity "attestations," so proof of personhood would ultimately be a gradient. (This does, admittedly, seem complicated though ... but World is already doing it, and there are many examples of services where providing additional information confers additional trust. Blue checkmarks to name the most obvious one.)
        As for what it might look like to start from the ground up and solve this problem, https://urbit.org/, for all its flaws, is the only serious attempt I know of and proves it's possible in principle, though perhaps not in practice
      - xlbuttplug2 41 minutes ago ago
        IDs would have to be reissued with a public/private key model you can use to sign your requests.
        > the person at the keyboard, is the same person as that ID identifies
        This won't be possible to verify - you could lend your ID out to bots but that would come at the risk of being detected and blanket banned from the internet.
    - xlbuttplug2 27 minutes ago ago
      The internet would come to a grinding halt as everyone would suddenly become mindful of their browsing. It's not hard to imagine a situation where, say, pornhub sells its access data and the next day you get sacked at your teaching job.
      [-]
      - chmod775 8 minutes ago ago
        It doesn't need to. Thanks to asymmetric cryptography governments can in theory provide you with a way to prove you are a human (or of a certain age) without:
        1. the government knowing who you are authenticating yourself to
        2. or the recipient learning anything but the fact that you are a human
        The EU is trying to build such a scheme for online age verification.
        [-]
        cakealert a minute ago ago
        Such schemes have the fatal flaw that it can be trivially abused. All you need are a couple of stolen/sold identities and bots start proving their humanness and adultness to everyone.
        ummonk 2 minutes ago ago
        How would it prevent you from renting your identity out to a bot farm?
      - glandium 11 minutes ago ago
        > sells its access data
        or has it leaked somehow.
    - tern an hour ago ago
      Eyeball company play is to be a general identity provider, which is an obvious move for anyone who tries to fill this gap. You can already connect your passport in the World app.
      https://world.org/blog/announcements/new-world-id-passport-c...
    - bhawks an hour ago ago
      Can't wait to sign into my web browser with my driver's license.
      [-]
      - overfeed an hour ago ago
        In all likelihood, most people will do so via the Apple Wallet (or the equivalent on their non-Apple devices). It's going to be painful to use Open source OSes for a while, thanks to CloudFlare and Anubis. This is not the future I want, but we can't have nice things.
    - nikau an hour ago ago
      Can't wait to start my stolen id as a service for the botnets
    - xenotux an hour ago ago
      Eh? With the "anonymous" models that we're pushing for right now, nothing stops you from handing over your verification token (or the control of your browser) to a robot for a fee. The token issued by the verifier just says "yep, that's an adult human", not "this is John Doe, living at 123 Main St, Somewhere, USA". If it's burned, you can get a new one.
      If we move to a model where the token is permanently tied to your identity, there might be an incentive for you not to risk your token being added to a blocklist. But there's no shortage of people who need a bit of extra cash and for whom it's not a bad trade. So there will be a nearly-endless supply of "burner" tokens for use by trolls, scammers, evil crawlers, etc.
      [-]
      - kelvinjps10 18 minutes ago ago
        If it's illegal that person could face legal consequences
  - neumann 2 hours ago ago
    It will be hard to tune them to be just the right level of ignorant and slow as us though!
    [-]
    - cwmoore an hour ago ago
      Soon enough there will be competing Unicode characters that can remove exclamation points.
- Aurornis 2 hours ago ago
  > This is a usually technical crowd, so I can't help but wonder if many people genuinely don't get it, or if they are just feigning a lack of understanding to be dismissive of Anubis.
  This is a confusing comment because it appears you don’t understand the well-written critique in the linked blog post.
  > This is like those simple things on submission forms that ask you what 7 + 2 is. Of course everyone knows that a crawler can calculate that! But it takes a human some time and work to tell the crawler HOW.
  The key point in the blog post is that it’s the inverse of a CAPTCHA: The proof of work requirement is solved by the computer automatically.
  You don’t have to teach a computer how to solve this proof of work because it’s designed for the computer to solve the proof of work.
  It makes the crawling process more expensive because it has to actually run scripts on the page (or hardcode a workaround for specific versions) but from a computational perspective that’s actually easier and far more deterministic than trying to have AI solve visual CAPTCHA challenges.
- odo1242 3 hours ago ago
  Also, it forces the crawler to gain code execution capabilities, which for many companies will just make them give up and scrape someone else.
- wat10000 2 hours ago ago
  Technical people are prone to black-and-white thinking, which makes it hard to understand that making something more difficult will cause people to do it less even though it’s still possible.
  [-]
  - mattnewton 17 minutes ago ago
    I think the argument on offer is more, this juice isn't worth the squeeze. Each user is being slowed down and annoyed for something that bots will trivially bypass if they become aware of it.
    [-]
    - wat10000 14 minutes ago ago
      If they become aware of it and actually think it’s worthwhile. Malicious bots work by scaling, and implementing special cases for every random web site doesn’t scale. And it’s likely they never even notice.
- tptacek 2 hours ago ago
  Respectfully, I think it's you missing the point here. None of this is to say you shouldn't use Anubis, but Tavis Ormandy is offering a computer science critique of how it purports to function. You don't have to care about computer science in this instance! But you can't dismiss it because it's computer science.
  Consider:
  An adaptive password hash like bcrypt or Argon2 uses a work function to apply asymmetric costs to adversaries (attackers who don't know the real password). Both users and attackers have to apply the work function, but the user gets ~constant value for it (they know the password, so to a first approx. they only have to call it once). Attackers have to iterate the function, potentially indefinitely, in the limit obtaining 0 reward for infinite cost.
  A blockchain cryptocurrency uses a work function principally as a synchronization mechanism. The work function itself doesn't have a meaningfully separate adversary. Everyone obtains the same value (the expected value of attempting to solve the next round of the block commitment puzzle) for each application of the work function. And note in this scenario most of the value returned from the work function goes to a small, centralized group of highly-capitalized specialists.
  A proof-of-work-based antiabuse system wants to function the way a password hash functions. You want to define an adversary and then find a way to incur asymmetric costs on them, so that the adversary gets minimal value compared to legitimate users.
  And this is in fact how proof-of-work-based antispam systems function: the value of sending a single spam message is so low that the EV of applying the work function is negative.
  But here we're talking about a system where legitimate users (human browsers) and scrapers get the same value for every application of the work function. The cost:value ratio is unchanged; it's just that everything is more expensive for everybody. You're getting the worst of both worlds: user-visible costs and a system that favors large centralized well-capitalized clients.
  There are antiabuse systems that do incur asymmetric costs on automated users. Youtube had (has?) one. Rather than simply attaching a constant extra cost for every request, it instead delivered a VM (through JS) to browsers, and programs for that VM. The VM and its programs were deliberately hard to reverse, and changed regularly. Part of their purpose was to verify, through a bunch of fussy side channels, that they were actually running on real browsers. Every time Youtube changed the VM, the bots had to do large amounts of new reversing work to keep up, but normal users didn't.
  This is also how the Blu-Ray BD+ system worked.
  The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
  The problem with "this is good because none of the scrapers even bother to do this POW yet" is that you don't need an annoying POW to get that value! You could just write a mildly complicated Javascript function, or do an automated captcha.
  [-]
  - xena 2 hours ago ago
    For what it's worth, kernel.org seems to be running an old version of Anubis that predates the current challenge generation method. Previously it took information about the user request, hashed it, and then relied on that being idempotent to avoid having to store state. This didn't scale and was prone to issues like in the OP.
    The modern version of Anubis as of PR https://github.com/TecharoHQ/anubis/pull/749 uses a different flow. Minting a challenge generates state including 64 bytes of random data. This random data is sent to the client and used on the server side in order to validate challenge solutions.
    The core problem here is that kernel.org isn't upgrading their version of Anubis as it's released. I suspect this means they're also vulnerable to GHSA-jhjj-2g64-px7c.
    [-]
    - tptacek 2 hours ago ago
      Right, I get that. I'm just saying that over the long term, you're going to have to find asymmetric costs to apply to scrapers, or it's not going to work. I'm not criticizing any specific implementation detail of your current system. It's good to have a place to take it!
      I think that's the valuable observation in this post. Tavis can tell me I'm wrong. :)
  - sugarpimpdorsey 2 hours ago ago
    A lot of these passive types of anti-abuse systems rely on the rather bold assumption that making a bot perform a computation is expensive, but isn't for me as an ordinary user.
    According to whom or what data exactly?
    AI operators are clearly well-funded operations and the amount of electricity and CPU power is negligible. Software like Anubis and nearly all its identical predecessors grant you access after a single "proof". So you then have free reign to scrape the whole site.
    The best physical analogy are those shopping cart things where you have to insert a quarter to unlock the cart, and you presumably get it back when you return the cart.
    The group of people this doesn't affect are the well-funded, a quarter is a small price to pay for leaving your cart in the middle of the parking lot.
    Those that suffer the most are the ones that can't find a quarter in the cupholder so you're stuck filling your arms with groceries.
    Would you be richer if they didn't charge you a quarter? (For these anti-bot tools you're paying the electric company, not the site owner.). Maybe. But if you're Scrooge McDuck who is counting?
    [-]
    - tptacek an hour ago ago
      Right, that's the point of the article. If you can tune asymmetric costs on bots/scrapers, it doesn't matter: you can drive bot costs to infinity without doing so for users. But if everyone's on a level playing field, POW is problematic.
  - seba_dos1 an hour ago ago
    > The term of art for these systems is "content protection", which is what I think Anubis actually wants to be, but really isn't (yet?).
    No, that's missing the point. Anubis is effectively a DDoS protection system, all the talking about AI bots comes from the fact that the latest wave of DDoS attacks was initiated by AI scrapers, whether intentionally or not.
    If these bots would clone git repos instead of unleashing the hordes of dumbest bots on Earth pretending to be thousands and thousands of users browsing through git blame web UI, there would be no need for Anubis.
    [-]
    - tptacek 43 minutes ago ago
      I'm not moralizing, I'm talking about whether it can work. If it's your site, you don't need to justify putting anything in front of it.
      [-]
      - seba_dos1 37 minutes ago ago
        Did you accidentally reply to a wrong comment? (not trying to be snarky, just confused)
        The only "justification" there would be is that it keeps the server online that struggled under load before deploying it. That's the whole reason why major FLOSS projects and code forges have deployed Anubis. Nobody cares about bots downloading FLOSS code or kernel mailing lists archives; they care about keeping their infrastructure running and whether it's being DDoSed or not.
        [-]
        tptacek 12 minutes ago ago
        I just said you didn't have to justify it. I don't care why you run it. Run whatever you want. The point of the post is that regardless of your reasons for running it, it's unlikely to work in the long run.
        [-]
        seba_dos1 9 minutes ago ago
        And what I said is that all these most visible deployments of Anubis did not deploy it to be a content protection system of any kind, so it doesn't have to work this way at all for them. As long as the server doesn't struggle with load anymore after deploying Anubis, it's a win - and it works so far.
        (and frankly, it likely will only need to work until the bubble bursts, making "the long run" irrelevant)
  - akoboldfrying an hour ago ago
    The (almost only?) distinguishing factor between genuine users and bots is the total volume of requests, but this can still be used for asymmetric costs. If botPain > botPainThreshold and humanPain < humanPainThreshold then Anubis is working as intended. A key point is that those inequalities look different at the next level of detail. A very rough model might be:
    botPain = nBotRequests * cpuWorkPerRequest * dollarsPerCpuSecond
    humanPain = c_1 * max(elapsedTimePerRequest) + c_2 * avg(elapsedTimePerRequest)
    The article points out that the botPain Anubis currently generates is unfortunately much too low to hit any realistic threshold. But if the cost model I've suggested above is in any way realistic, then useful improvements would include:
    1. More frequent but less taxing computation demands (this assumes c_1 >> c_2)
    2. Parallel computation (this improves the human experience with no effect for bots)
    ETA: Concretely, regarding (1), I would tolerate 500ms lag on every page load (meaning forget about the 7-day cookie), and wouldn't notice 250ms.
    [-]
    - tptacek 41 minutes ago ago
      That's exactly what I'm saying isn't happening: the user pays some cost C per article, and the bot pays exactly the same cost C. Both obtain the same reward. That's not how Hashcash works.
Arnavion 7 hours ago ago
>This dance to get access is just a minor annoyance for me, but I question how it proves I’m not a bot. These steps can be trivially and cheaply automated.
>I think the end result is just an internet resource I need is a little harder to access, and we have to waste a small amount of energy.
No need to mimic the actual challenge process. Just change your user agent to not have "Mozilla" in it; Anubis only serves you the challenge if it has that. For myself I just made a sideloaded browser extension to override the UA header for the handful of websites I visit that use Anubis, including those two kernel.org domains.
(Why do I do it? For most of them I don't enable JS or cookies for so the challenge wouldn't pass anyway. For the ones that I do enable JS or cookies for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
[-]
- johnecheck 5 hours ago ago
  Sadly, touching the user-agent header more or less instantly makes you uniquely identifiable.
  Browser fingerprinting works best against people with unique headers. There's probably millions of people using an untouched safari on iPhone. Once you touch your user-agent header, you're likely the only person in the world with that fingerprint.
  [-]
  - sillywabbit 3 hours ago ago
    If someone's out to uniquely identify your activity on the internet, your User-Agent string is going to be the least of your problems.
    [-]
    - _def an hour ago ago
      Not sure what you mean, as exactly this is happening currently on 99% of the web. Brought to you by: ads
      [-]
      - amusingimpala75 an hour ago ago
        I think what they meant is: there’s already so many other ways to fingerprint (say, canvas) that a common user agent doesn’t significantly help you
  - Arnavion 5 hours ago ago
    UA fingerprinting isn't a problem for me. As I said I only modify the UA for the handful of sites that use Anubis that I visit. I trust those sites enough that them fingerprinting me is unlikely, and won't be a problem even if they did.
  - codedokode 4 hours ago ago
    If your headers are new every time then it is very difficult to figure out who is who.
    [-]
    - spoaceman7777 4 hours ago ago
      yes, but it puts you in the incredibly small bucket of "users that has weird headers that don't mesh well", and makes using the rest of the (many) other fingerprinting techniques all the more accurate.
    - kelseydh 3 hours ago ago
      It is very easy unless the IP address is also switching up.
  - NoMoreNicksLeft 5 hours ago ago
    I'll set mine to "null" if the rest of you will set yours...
    [-]
    - gabeio 2 hours ago ago
      The string “null” or actually null? I have recently seen a huge amount of bot traffic which has actually no UA and just outright block it. It’s almost entirely (microsoft cloud) Azure script attacks.
  - andrewmcwatters 4 hours ago ago
    Yes, but you can take the bet, and win more often than not, that your adversary is most likely not tracking visitor probabilities if you can detect that they aren't using a major fingerprinting provider.
  - jagged-chisel 5 hours ago ago
    I wouldn’t think the intention is to s/Mozilla// but to select another well-known UA string.
    [-]
    - Arnavion 5 hours ago ago
      The string I use in my extension is "anubis is crap". I took it from a different FF extension that had been posted in a /g/ thread about Anubis, which is where I got the idea from in the first place. I don't use other people's extensions if I can help it (because of the obvious risk), but I figured I'd use the same string in my own extension so as to be combined with users of that extension for the sake of user-agent statistics.
      [-]
      - CursedSilicon 5 hours ago ago
        It's a bit telling that you "don't use extensions if you can help it" but trust advice from a 4chan board
        [-]
        Arnavion 5 hours ago ago
        It's also a bit telling that you read the phrase "I took it from a different FF extension that had been posted" and interpreted it as taking advice instead of reading source code.
        username135 3 hours ago ago
        4chan, the worlds greatest hacker
    - soulofmischief 5 hours ago ago
      The UA will be compared to other data points such as screen resolution, fonts, plugins, etc. which means that you are definitely more identifiable if you change just the UA vs changing your entire browser or operating system.
    - throwawayffffas 4 hours ago ago
      I don't think there are any.
      Because servers would serve different content based on user agent virtually all browsers start with Mozilla/5.0...
      [-]
      - extraduder_ire 3 hours ago ago
        curl, wget, lynx, and elinks all don't by default (I checked). Mainstream web browsers likely all do, and will forever.
- Animats 6 hours ago ago
  > (Why do I do it? For most of them I don't enable JS so the challenge wouldn't pass anyway. For the ones that I do enable JS for, various self-hosted gitlab instances, I don't consent to my electricity being used for this any more than if it was mining Monero or something.)
  Hm. If your site is "sticky", can it mine Monero or something in the background?
  We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
  [-]
  - mikestew 6 hours ago ago
    We need a browser warning: "This site is using your computer heavily in a background task. Do you want to stop that?"
    Doesn't Safari sort of already do that? "This tab is using significant power", or summat? I know I've seen that message, I just don't have a good repro.
    [-]
    - qualeed 6 hours ago ago
      Edge does, as well. It drops a warning in the middle of the screen, displays the resource-hogging tab, and asks whether you want to force-close the tab or wait.
- zahlman 6 hours ago ago
  > Just change your user agent to not have "Mozilla" in it. Anubis only serves you the challenge if you have that.
  Won't that break many other things? My understanding was that basically everyone's user-agent string nowadays is packed with a full suite of standard lies.
  [-]
  - Arnavion 6 hours ago ago
    It doesn't break the two kernel.org domains that the article is about, nor any of the others I use. At least not in a way that I noticed.
  - throwawayffffas 4 hours ago ago
    In 2025 I think most of the web has moved on from checking user strings. Your bank might still do it but they won't be running Anubis.
- danieltanfh95 an hour ago ago
  wtf? how is this then better than a captcha or something similar?!
userbinator an hour ago ago
As I've been saying for a while now - if you want to filter for only humans, ask questions only a human can easily answer; counting the number of letters in a word seems to be a good way to filter out LLMs, for example. Yes, that can be relatively easily gotten around, just like Anubis, but with the benefit that it doesn't filter out humans and has absolutely minimal system requirements (a browser that can submit HTML forms), possibly even less than the site itself.
There are forums which ask domain-specific questions as a CAPTCHA upon attempting to register an account, and as someone who has employed such a method, it is very effective. (Example: what nominal diameter is the intake valve stem on a 1954 Buick Nailhead?)
[-]
- cm2012 14 minutes ago ago
  There is a decent segment of the population that will gave a hard time with that.
rootsudo 7 hours ago ago
When I instantly read it, I knew it was anubis. I hope the anime catgirls never disapear from that project :)
[-]
- bawolff 3 hours ago ago
  Its nice to see there is still some whimsy on the internet.
  Everything got so corporate and sterile.
- ghssds 3 hours ago ago
  As Anubis the egyptian god is represented as a dog-headed human, I thought the drawing was of a dog-girl.
  [-]
  - nemomarx 3 hours ago ago
    Perhaps a jackal girl? I guess "cat girl" gets used very broadly to mean kemomimi (pardon the spelling) though
    [-]
    - m4rtink 3 hours ago ago
      kemono == animal
      mimi == ears
- Der_Einzige an hour ago ago
  It's not the only project with an anime girl as its mascot.
  ComfyUI has what I think is a foxgirl as its official mascot, and that's the de-facto primary UI for generating Stable Diffusion or related content.
- NelsonMinar 3 hours ago ago
  ¡Nyah!
- bakugo 6 hours ago ago
  It's more likely that the project itself will disappear into irrelevance as soon as AI scrapers bother implementing the PoW (which is trivial for them, as the post explains) or figure out that they can simply remove "Mozilla" from their user-agent to bypass it entirely.
  [-]
  - debugnik 6 hours ago ago
    > as AI scrapers bother implementing the PoW
    That's what it's for, isn't it? Make crawling slower and more expensive. Shitty crawlers not being able to run the PoW efficiently or at all is just a plus. Although:
    > which is trivial for them, as the post explains
    Sadly the site's being hugged to death right now so I can't really tell if I'm missing part of your argument here.
    > figure out that they can simply remove "Mozilla" from their user-agent
    And flag themselves in the logs to get separately blocked or rate limited. Servers win if malicious bots identify themselves again, and forcing them to change the user agent does that.
    [-]
    - throwawayffffas 4 hours ago ago
      > That's what it's for, isn't it? Make crawling slower and more expensive.
      The default settings produce a computational cost of milliseconds for a week of access. For this to be relevant it would have to be significantly more expensive to the point it would interfere with human access.
      [-]
      - seba_dos1 2 hours ago ago
        ...unless you're sus, then the difficulty increases. And if you unleash a single scrapping bot, you're not a problem anyway. It's for botnets of thousands, mimicking browsers on residual connections to make them hard to filter out or rate limit, effectively DDoSing the server.
        Perhaps you just don't realize how much did the scraping load increase in the last 2 years or so. If your server can stay up after deploying Anubis, you've already won.
    - dcminter 5 hours ago ago
      > Sadly the site's being hugged to death right now
      Luckily someone had already captured an archive snapshot: https://archive.ph/BSh1l
    - shkkmo 5 hours ago ago
      The explanation of how the estimate is made is more detailed, but here is the referenced conclusion:
      >> So (11508 websites * 2^16 sha256 operations) / 2^21, that’s about 6 minutes to mine enough tokens for every single Anubis deployment in the world. That means the cost of unrestricted crawler access to the internet for a week is approximately $0.
      >> In fact, I don’t think we reach a single cent per month in compute costs until several million sites have deployed Anubis.
      [-]
      - hiccuphippo 5 hours ago ago
        Wasn't sha256 designed to be very fast to generate? They should be using bcrypt or something similar.
        [-]
        throwawayffffas 4 hours ago ago
        Unless they require a new token for each new request or every x minutes or something it won't matter.
        And as the poster mentioned if you are running an AI model you probably have GPUs to spare. Unlike the dev working from a 5 year old Thinkpad or their phone.
      - debugnik 5 hours ago ago
        That's a matter of increasing the difficulty isn't it? And if the added cost is really negligible, we can just switch to a "refresh" challenge for the same added latency and without burning energy for no reason.
        [-]
        Retr0id 5 hours ago ago
        If you increase the difficulty much beyond what it currently is, legitimate users end up having to wait for ages.
        therein 5 hours ago ago
        I am guessing you don't realize that that means people using not the latest generation phones will suffer.
  - skydhash 6 hours ago ago
    It's more about the (intentional?) DDoS from AI scrappers, than preventing them from accessing the content. Bandwidth is not cheap.
  - unclad5968 5 hours ago ago
    Im not on Firefox or any Firefox derivative and I still get anime cat girls making sure I'm not a bot.
    [-]
    - nemomarx 5 hours ago ago
      Mozilla is used in the user agent string of all major browsers for historical reasons, but not necessarily headless ones or so on.
      [-]
      - unclad5968 4 hours ago ago
        Oh that's interesting, I had no idea.
        [-]
        seabrookmx 2 hours ago ago
        There's some sites[1] that can print your user agent for you. Try it in a few different browsers and you will be surprised. They're honestly unhinged.. I have no idea why we still use this header in 2025!
        [1]: https://dnschecker.org/user-agent-info.php
  - dingnuts 6 hours ago ago
    PoW increases the cost for the bots which is great. Trivial to implement, sure, but that added cost will add up quickly.
    Anyway, then we'll move on to tarpits using traditional methods to cheaply generate real enough looking content that the data becomes worthless.
    Fuck AI scrapers, and fuck all this copyright infringement at scale. If it was illegal for Aaron Schwarz it's definitely illegal for Sam Altman.
    Frankly, most of these scrapers are in violation of the CFAA as well, a federal crime.
    [-]
    - verteu 5 hours ago ago
      > PoW increases the cost for the bots which is great. Trivial to implement, sure, but that added cost will add up quickly.
      No, the article estimates it would cost less than a single penny to scrape all pages of 1,000,000 distinct Anubis-guarded websites for an entire month.
      [-]
      - thunderfork 5 hours ago ago
        Once you've built the system that lets you do that, maybe. You still have to do that, though, so it's still raising the cost floor.
        [-]
        vmttmv 3 hours ago ago
        but... how? when the author ran the numbers, the rough estimate is solving the challenges at a rate of 10000/5 min, on a single instance of the free tier of google compute. that is an insignificant load at an even more insignificant cost.
    - altairprime 5 hours ago ago
      Don’t forget signed attestations from “user probably has skin in the game” cloud providers like iCloud (already live in Safari and accepted by Cloudflare, iirc?) — not because they identify you but because abusive behavior will trigger attestation provider rate limiting and termination of services (which, in Apple’s case, includes potentially a console kill for the associated hardware). It’s not very popular to discuss at HN but I bet Anubis could add support for it regardless :)
      https://datatracker.ietf.org/wg/privacypass/about/
      https://www.w3.org/TR/vc-overview/
    - shkkmo 5 hours ago ago
      > PoW increases the cost for the bots which is great.
      But not by any meaningful amount as explained in the article. All it actually does is rely on it's obscurity while interfering with legitimate use.
    - nialv7 5 hours ago ago
      > Fuck AI scrapers, and fuck all this copyright infringement at scale.
      Yes, fuck them. Problem is Anubis here is not doing the job. As the article already explains, currently Anubis is not adding a single cent to the AI scrappers' costs. For Anubis to become effective against scrappers, it will necessarily have to become quite annoying for legitimate users.
      [-]
      - Gibbon1 5 hours ago ago
        Best response to AI scrapers is to poison their models.
        [-]
        nemomarx 5 hours ago ago
        how well is modern poisoning holding up?
        [-]
        CursedSilicon 5 hours ago ago
        I'll tell you in a second. First I wanna try adding gasoline to my spaghetti as suggested by Google's search
        [-]
        snerbles 3 hours ago ago
        A balanced diet of hydrocarbons in your carbohydrates!
        codedokode 4 hours ago ago
        What about appealing to ethics, i.e. posting messages about how a poor catgirl ended up on the street because AI took her job? To make AI refuse to reply due to ethical concerns?
bawolff 3 hours ago ago
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
Counterpoint - it seems to work. People use anubis because its the best of bad options.
If theory and reality disagree, it means either you are missing something or your theory is wrong.
ksymph 11 hours ago ago
This is neither here nor there but the character isn't a cat. It's in the name, Anubis, who is an Egyptian deity typically depicted as a jackal or generic canine, and the gatekeeper of the afterlife who weighs the souls of the dead (hence the tagline). So more of a dog-girl, or jackal-girl if you want to be technical.
[-]
- pak9rabid 5 hours ago ago
  Well, thank you for that. That's a great weight off me mind.
- JdeBP 5 hours ago ago
  ... but entirely lacking the primary visual feature that Anubis had.
sidewndr46 3 hours ago ago
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans
I'm an unsure if this deadpan humor or if the author has never tried to solve a CAPTCHA that is something like "select the squares with an orthodox rabbi present"
[-]
- Lammy 2 hours ago ago
  I enjoyed the furor around the 2008 RapidShare catpcha lol
  - https://www.htmlcenter.com/blog/now-thats-an-annoying-captch...
  - https://depressedprogrammer.wordpress.com/2008/04/20/worst-c...
  - https://medium.com/xato-security/a-captcha-nightmare-f6176fa...
- classichasclass 3 hours ago ago
  The problem with that CAPTCHA is you're not allowed to solve it on Saturdays.
- wingworks 3 hours ago ago
  There are also services out that will solve any CAPTCHA for you at a very small cost to you. And an AI company will get steep discounts with the volumes of traffic they do.
  There are some browser extensions for it too, like NopeCHA, it works 99% of the time and saves me the hassle of doing them.
  Any site using CAPTCHA's today is really only hurting there real customers and low hanging fruit.
  Of course this assumes they can't solve the capture themselves, with ai, which often they can.
  [-]
  - petesergeant an hour ago ago
    Yes, but not at a rate that enables them to be a risk to your hosting bill. My understanding is that the goal here isn't to prevent crawlers, it's to prevent overly aggressive ones.
- bawolff 3 hours ago ago
  Well the problem is that computers got good at basically everything.
  Early 2000s captchas really were like that.
  [-]
  - ok123456 2 hours ago ago
    The original reCAPTCHA was doing distributed book OCR. It was sold as an altruistic project to help transcribe old books.
ok123456 2 hours ago ago
Why is kernel.org doing this for essentially static content? Cache control headers and ETAGS should solve this. Also, the Linux kernel has solved the C10K problem.
[-]
- mixologic 2 hours ago ago
  Because its static content that is almost never cached because its infrequently accessed. Thus, almost every hit goes to the origin.
bogwog 7 hours ago ago
I wonder if the best solution is still just to create link mazes with garbage text like this: https://blog.cloudflare.com/ai-labyrinth/
It won't stop the crawlers immediately, but it might lead to an overhyped and underwhelming LLM release from a big name company, and force them to reassess their crawling strategy going forward?
leumon 6 hours ago ago
Seems like ai bots are indeed bypassing the challenge by computing it: https://social.anoxinon.de/@Codeberg/115033790447125787
[-]
- debugnik 5 hours ago ago
  That's not bypassing it, that's them finally engaging with the PoW challenge as intended, making crawling slower and more expensive, instead failing to crawl at all, which is more of a plus.
  This however forces servers to increase the challenge difficulty, which increases the waiting time for the first-time access.
  [-]
  - NoGravitas 3 hours ago ago
    The point is that it will always be cheaper for bot farms to pass the challenge than for regular users.
    [-]
    - bawolff 3 hours ago ago
      It might be a lot closer if they were using argon2 instead of sha. Sha is a kind of bad choice for this sort of thinh.
  - nialv7 5 hours ago ago
    Obviously the developer of Anubis thinks it is bypassing: https://github.com/TecharoHQ/anubis/issues/978
    [-]
    - debugnik 5 hours ago ago
      Fair, then I obviously think Xe may have a kinda misguided understanding of their own product. I still stand by the concept I stated above.
      [-]
      - rhaps0dy 3 hours ago ago
        latest update from Xe:
        > After further investigation and communication. This is not a bug. The threat actor group in question installed headless chrome and simply computed the proof of work. I'm just going to submit a default rule that blocks huawei.
  - hiccuphippo 4 hours ago ago
    Too bad the challenge's result is only a waste of electricity. Maybe they should do like some of those alt-coins and search for prime numbers or something similar instead.
    [-]
    - kevincox 4 hours ago ago
      Of course that doesn't directly help the site operator. Maybe it could actually do a bit of bitcoin mining for the site owner. Then that could pay for the cost of accessing the site.
    - bawolff 3 hours ago ago
      Most of those alt-coins are kind of fake/scams. Its really hard to make it work with actually useful problems.
  - danieltanfh95 an hour ago ago
    this only holds through if the data to be accessed is less valuable than the computational cost. in this case, that is false and spending a few dollars to scrape data is more than worth.
    reducing the problem to a cost issue is bound to be short sighted.
sugarpimpdorsey 6 hours ago ago
Every time I see one of these I think it's a malicious redirect to some pervert-dwelling imageboard.
On that note, is kernel.org really using this for free and not the paid version without the anime? Linux Foundation really that desperate for cash after they gas up all the BMW's?
[-]
- qualeed 6 hours ago ago
  It's crazy (especially considering anime is more popular now than ever; netflix alone is making billions a year on anime) that people see a completely innocent little anime picture and immediately think "pervent-dwelling imageboard".
  [-]
  - Seattle3503 6 hours ago ago
    To be fair, that's the sort of place where I spend most of my free time.
  - gruez 6 hours ago ago
    "Anime pfp" stereotype is alive and well.
  - ants_everywhere 5 hours ago ago
    they've seized the moment to move the anime cat girls off the Arch Linux desktop wallpapers and onto lore.kernel.org.
  - turtletontine 5 hours ago ago
    Even if the images aren’t the kind of sexualized (or downright pornographic) content this implies… having cutesy anime girls pop up when a user loads your site is, at best, wildly unprofessional. (Dare I say “cringe”?) For something as serious and legit as kernel.org to have this, I do think it’s frankly shocking and unacceptable.
    [-]
    - Modified3019 5 hours ago ago
      https://storage.courtlistener.com/recap/gov.uscourts.miwd.11...
      https://storage.courtlistener.com/recap/gov.uscourts.miwd.11...
      “The future is now, old man”
      [-]
      - delecti 3 hours ago ago
        Assuming your quote isn't a joke, I think those links prove the opposite.
        Not only is it unprofessional, courts have found it impermissible.
      - staringback 4 hours ago ago
        This is the most hilarious thing I have ever read from HN, thank you.
    - ge96 5 hours ago ago
      never forget the Ponies CV of an ML guy https://www.huffingtonpost.co.uk/2013/09/03/my-little-pony-r...
    - antiloper 5 hours ago ago
      If anime girls prevent LLM scraper sympathizers from interacting with the kernel, that's a good thing and should be encouraged more!
    - Hamuko 5 hours ago ago
      Isn't the mascot/logo for the Linux kernel a cartoon penguin?
      [-]
      - qualeed 5 hours ago ago
        Right, but, that's different. Penguins are serious and professional.
        [-]
        xsmasher 4 hours ago ago
        I mean, he's wearing a tuxedo!
      - consp 5 hours ago ago
        I have a plushy tux at home (about 30cm high). So now I'm in the same league as the people with anime pillows?
        [-]
        nemomarx 5 hours ago ago
        Well, the people with anime plushies would be a better comparison. There's plenty more of those than pillows.
        xeonmc 5 hours ago ago
        It depends. What do you do with the plushy?
        [-]
        f1refly 5 hours ago ago
        I bet he's keeping it on some shelf because he think it's cute like only a true sicko would do
        pests 4 hours ago ago
        What’s the difference?
    - aseipp 4 hours ago ago
      You'll live.
- Lammy 5 hours ago ago
  > Every time I see one of these I think it's a malicious redirect to some pervert-dwelling imageboard.
  Anubis is a clone of Kiwiflare, not an original work, so you're actually sort of half-right: https://kiwifarms.st/threads/kiwiflare.147312/ (2022)
  (Standard disclaimer that sharing this link is not endorsement of this website and its other contents)
  [-]
  - sugarpimpdorsey 4 hours ago ago
    > Anubis is a clone of Kiwiflare, not an original work, so you're actually sort of half-right:
    Interesting. That itself appears to be a clone of haproxy-protection. I know there has also been an nginx module that does the same for some time. Either way, proof-of-work is by this point not novel.
    Everyone seems to have overlooked the more substantive point of my comment which is that it appears kernel.org cheaped out and is using the free version of Anubis, instead of paying up to support the developer for his work. You know they have the money to do it.
    In 2024 the Linux Foundation reported $299.7M in expenses, with $22.7M of that going toward project infrastructure and $15.2M on "event services" (I guess making sure the cotton candy machines and sno-cone makers were working at conferences).
    My point is, cough up a few bucks for a license you chiselers.
    [-]
    - murderfs 2 hours ago ago
      > My point is, cough up a few bucks for a license you chiselers.
      You mean this one? https://github.com/TecharoHQ/anubis/blob/main/LICENSE
      [-]
      - sugarpimpdorsey an hour ago ago
        No I mean this one:
        https://anubis.techaro.lol/docs/admin/botstopper
  - fortran77 15 minutes ago ago
    I saw the description and thought "Wow! That works just like the DDOS retarding" of KiwiFlare. I didn't know it was a proper fork of it.
  - efilife 4 hours ago ago
    Can somebody please explain why was this comment flagged to death? I seem to be missing something
    [-]
    - ufo 3 hours ago ago
      Possibly because it links to kiwifarms (nasty website to say the least)
hansjorg 6 hours ago ago
If you want a tip my friend, just block all of Huawei Cloud by ASN.
[-]
- wging 2 hours ago ago
  ... looks like they did: https://github.com/TecharoHQ/anubis/pull/1004, timestamped a few hours after your comment.
xphos 4 hours ago ago
Yeah the PoW is minor for botters but annoying people. I think the only positive is if enough people see anime girls on there screens there might actually be political pressure to make laws against rampent bot crawling
[-]
- Havoc 3 hours ago ago
  > PoW is minor for botters
  But still enough to prevent a billion request DDoS
  These sites have been search engine scrapped forever. It’s not about blocking bots entirely just about this new wave of fuck you I don’t care if your host goes down quasi malicious scrappers
  [-]
  - elcritch 23 minutes ago ago
    Reading TFA, those billions requests would cost web crawlers what about $100 in compute?
  - st3fan an hour ago ago
    "But still enough to prevent a billion request DDoS" - don't you just do the PoW once to get a cookie and then you can browse freely?
    [-]
    - seba_dos1 an hour ago ago
      Yes, but a single bot is not a concern. It's the first "D" in DDoS that makes it hard to handle
      (and these bots tend to be very, very dumb - which often happens to make them more effective at DDoSing the server, as they're taking the worst and the most expensive ways to scrape content that's openly available more efficiently elsewhere)
galaxyLogic 26 minutes ago ago
I think the solution to captcha-rot is micro-payments. It does consume resources to serve a web-page so whose gonna pay for that?
If you want to do advertisement then don't require a payment, and be happy that crawlers will spread your ad to the users of AI-bots.
If you are a non-profit-site then it's great to get a micro-payment to help you maintain and run the site.
extraduder_ire 2 hours ago ago
With the asymmetry of doing the PoW in javascript versus compiled c code, I wonder if this type of rate limiting is ever going to be directly implemented into regular web browsers. (I assume there's already plugins for curl/wget)
Other than Safari, mainstream browsers seem to have given up on considering browsing without javascript enabled a valid usecase. So it would purely be a performance improvement thing.
alt187 an hour ago ago
This is a usually technical crowd, so I can't really blame people for "not getting it", because Anubis has nothing to do with tech.
The raison d'être of Anubis is pure virtue signalling, in the most absolute sense.
Incidentally, it has nothing to do with rate limiting, or blocking AI scrapers. Since Anubis can trivially be bypassed by simply removing "Mozilla" from the User-Agent.
The only reason people have been installing Anubis _en masse_ is because Xe Iaso (the author of Anubis) is a poser child for one of the prevalent, newest ideological currents in the FOSS sphere, and this is an easy and (relatively) unobtrusive way to broadcast your support to this set of beliefs and attitudes.
Ironically, Anubis has been developed using ChatGPT, which really ties together the whole theory. Blood money and all that.
[-]
- fortran77 19 minutes ago ago
  Exactly right. Few here get it because everyone here climbs over each other to see who can virtue signal the most.
listic 6 hours ago ago
So... Is Anubis actually blocking bots because they didn't bother to circumvent it?
[-]
- loloquwowndueo 3 hours ago ago
  Basically. Anubis is meant to block mindless, careless, rude bots with seemingly no technically proficient human behind the process; these bots tend to be very aggressive and make tons of requests bringing sites down.
  The assumption is that if you’re the operator of these bots and care enough to implement the proof of work challenge for Anubis you could also realize your bot is dumb and make it more polite and considerate.
  Of course nothing precludes someone implementing the proof of work on the bot but otherwise leaving it the same (rude and abusive). In this case Anubis still works as a somewhat fancy rate limiter which is still good.
  [-]
  - elcritch 16 minutes ago ago
    Essentially the Pow aspect is pointless then? They could require almost any arbitrary thing.
    [-]
    - loloquwowndueo 12 minutes ago ago
      What else do you envision being used instead of proof of work?
jimmaswell 11 hours ago ago
What exactly is so bad about AI crawlers compared to Google or Bing? Is there more volume or is it just "I don't like AI"?
[-]
- themafia 4 hours ago ago
  If you want my help training up your billion dollar model then you should pay me. My content is for humans. If you're not a human you are an unwelcome burden.
  Search engines, at least, are designed to index the content, for the purpose of helping humans find it.
  Language models are designed to filch content out of my website so it can reproduce it later without telling the humans where it came from or linking them to my site to find the source.
  This is exactly the reason that "I just don't like 'AI'." You should ask the bot owners why they "just don't like appropriate copyright attribution."
  [-]
  - jimmaswell 9 minutes ago ago
    > copyright attribution
    You can't copyright an idea, only a specific expression of an idea. An LLM works at the level of "ideas" (in essence - for example if you subtract the vector for "woman" from "man" and add the difference to "king" you get a point very close to "queen") and reproduces them in new contexts and makes its own connections to other ideas. It would be absurd for you to demand attribution and payment every time someone who read your Python blog said "Python is dynamically type-checked and garbage-collected". Thankfully that's not how the law works. Abusive traffic is a problem, but the world is a better place if humans can learn from these ideas with the help of ChatGPT et al. and to say they shouldn't be allowed to just because your ego demands credit for every idea someone learns from you is purely selfish.
- marvinborner 3 hours ago ago
  As a reference on the volume aspect: I have a tiny server where I host some of my git repos. After the fans of my server spun increasingly faster/louder every week, I decided to log the requests [1]. In a single week, ClaudeBot made 2.25M (!) requests (7.55GiB), whereas GoogleBot made only 24 requests (8.37MiB). After installing Anubis the traffic went down to before the AI hype started.
  [1] https://types.pl/@marvin/114394404090478296
- dilDDoS 7 hours ago ago
  As others have said, it's definitely volume, but also the lack of respecting robots.txt. Most AI crawlers that I've seen bombarding our sites just relentlessly scrape anything and everything, without even checking to see if anything has changed since the last time they crawled the site.
  [-]
  - benou 6 hours ago ago
    Yep, AI scrapers have been breaking our open-source project gerrit instance hosted at Linux Network Foundation.
    Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me. This should be a solved problem. But it looks like this field is full of wrongly behaving companies with complete disregards toward common goods.
    [-]
    - johnnyanmac 5 hours ago ago
      >Why this is the case while web-crawlers have been scrapping the web for the last 30 years is a mystery to me.
      a mix of ignorance, greed, and a bit of the tragedy of the commons. If you don't respect anyone around you, you're not going to care about any rules or ettiquite that don't directly punish you. Society has definitely broken down over the decades.
- ezrast 5 hours ago ago
  High volume and inorganic traffic patterns. Wikimedia wrote about it here: https://diff.wikimedia.org/2025/04/01/how-crawlers-impact-th...
- Philpax 11 hours ago ago
  Volume, primarily - the scrapers are running full-tilt, which many dynamic websites aren't designed to handle: https://pod.geraspora.de/posts/17342163
  [-]
  - zahlman 6 hours ago ago
    Why not just actually rate-limit everyone, instead of slowing them down with proof-of-work?
    [-]
    - NobodyNada 6 hours ago ago
      My understanding is that AI scrapers rotate IPs to bypass rate-limiting. Anubis requires clients to solve a proof-of-work challenge upon their first visit to the site to obtain a token that is tied to their IP and is valid for some number of requests -- thus forcing impolite scrapers to solve a new PoW challenge each time they rotate IPs, while being unobtrusive for regular users and scrapers that don't try to bypass rate limits.
      It's like a secondary rate-limit on the ability of scrapers to rotate IPs, thus allowing your primary IP-based rate-limiting to remain effective.
      [-]
      - Symbiote 3 hours ago ago
        Earlier today I found we'd served over a million requests to over 500,000 different IPs.
        All had the same user agent (current Safari), they seem to be from hacked computers as the ISPs are all over the world.
        The structure of the requests almost certainly means we've been specifically targeted.
        But it's also a valid query, reasonably for normal users to make.
        From this article, it looks like Proof of Work isn't going to be the solution I'd hoped it would be.
        [-]
        NobodyNada 2 hours ago ago
        The math in the article assumes scrapers only need one Anubis token per site, whereas a scraper using 500,000 IPs would require 500,000 tokens.
        Scaling up the math in the article, which states it would take 6 CPU-minutes to generate enough tokens to scrape 11,508 Anubis-using websites, we're now looking at 4.3 CPU-hours to obtain enough tokens to scrape your website (and 50,000 CPU-hours to scrape the Internet). This still isn't all that much -- looking at cloud VM prices, that's around 10c to crawl your website and $1000 to crawl the Internet, which doesn't seem like a lot but it's much better than "too low to even measure".
        However, the article observes Anubis's default difficulty can be solved in 30ms on a single-core server CPU. That seems unreasonably low to me; I would expect something like a second to be a more appropriate difficulty. Perhaps the server is benefiting from hardware accelerated sha256, whereas Anubis has to be fast enough on clients without it? If it's possible to bring the JavaScript PoW implementation closer to parity with a server CPU (maybe using a hash function designed to be expensive and hard to accelerate, rather than one designed to be cheap and easy to accelerate), that would bring the cost of obtaining 500k tokens up to 138 CPU-hours -- about $2-3 to crawl one site, or around $30,000 to crawl all Anubis deployments.
        I'm somewhat skeptical of the idea of Anubis -- that cost still might be way too low, especially given the billions of VC dollars thrown at any company with "AI" in their sales pitch -- but I think the article is overly pessimistic. If your goal is not to stop scrapers, but rather to incentivize scrapers to be respectful by making it cheaper to abide by rate limits than it is to circumvent them, maybe Anubis (or something like it) really is enough.
        (Although if it's true that AI companies really are using botnets of hacked computers, then Anubis is totally useless against bots smart enough to solve the challenges since the bots aren't paying for the CPU time.)
  - immibis 9 hours ago ago
    Why haven't they been sued and jailed for DDoS, which is a felony?
    [-]
    - ranger_danger 8 hours ago ago
      Criminal convictions in the US require a standard of proof that is "beyond a reasonable doubt" and I suspect cases like this would not pass the required mens rea test, as, in their minds at least (and probably a judge's), there was no ill intent to cause a denial of service... and trying to argue otherwise based on any technical reasoning (e.g. "most servers cannot handle this load and they somehow knew it") is IMO unlikely to sway the court... especially considering web scraping has already been ruled legal, and that a ToS clause against that cannot be legally enforced.
      [-]
      - s1mplicissimus 5 hours ago ago
        coming from a different legal system so please forgive my ignorance: Is it necessary in the US to prove ill intent in order to sue for repairs? Just wondering, because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
        [-]
        johnnyanmac 5 hours ago ago
        >Is it necessary in the US to prove ill intent in order to sue for repairs?
        As a general rule of thumb: you can sue anyone for anything in the US. There are even a few cases where someone tried to sue God: https://en.wikipedia.org/wiki/Lawsuits_against_supernatural_...
        When we say "do we need" or "can we do" we're talking about the idea of how plausible it is to win case. A lawyer won't take a case with bad odds of winning, even if you want to pay extra because a part of their reputation lies on taking battles they feel they can win.
        >because when I accidentally punch someones tooth out, I would assume they certainly are entitled to the dentist bill.
        IANAL, so the boring answer is "it depends". reparations aren't guaranteed, but there's 50 different state laws to consider, on top of federal law.
        Generally, they are not entitled to pay for damages themselves, but they may possibly be charged with battery. Intent will be a strong factor in winning the case.
      - slowmovintarget 7 hours ago ago
        I thought only capital crimes (murder, for example) held the standard of beyond a reasonable doubt. Lesser crimes require the standard of either a "Preponderance of Evidence" or "Clear and Convincing Evidence" as burden of proof.
        Still, even by those lesser standards, it's hard to build a case.
        [-]
        Majromax 7 hours ago ago
        It's civil cases that have the lower standard of proof. Civil cases arise when one party sues another, typically seeking money, and they are claims in equity, where the defendant is alleged to have harmed the plaintiff in some way.
        Criminal cases require proof beyond a reasonable doubt. Most things that can result in jail time are criminal cases. Criminal cases are almost always brought by the government, and criminal acts are considered harm to society rather than to (strictly) an individual. In the US, criminal cases are classified as "misdemeanors" or "felonies," but that language is not universal in other jurisdictions.
        [-]
        slowmovintarget 3 hours ago ago
        Thank you.
        eurleif 7 hours ago ago
        No, all criminal convictions require proof beyond a reasonable doubt: https://constitution.congress.gov/browse/essay/amdt14-S1-5-5...
        >Absent a guilty plea, the Due Process Clause requires proof beyond a reasonable doubt before a person may be convicted of a crime.
        [-]
        slowmovintarget 3 hours ago ago
        Thank you.
- blibble 6 hours ago ago
  they seem to be written by either idiots and/or people that don't give a shit about being good internet citizens
  either way the result is the same: they induce massive load
  well written crawlers will:
```
  - not hit a specific ip/host more frequently than say 1 req/5s
  - put newly discovered URLs at the end of a distributed queue (NOT do DFS per domain)
  - limit crawling depth based on crawled page quality and/or response time
  - respect robots.txt
  - make it easy to block them
```
herf 2 hours ago ago
We deployed hashcash for a while back in 2004 to implement Picasa's email relay - at the time it was a pretty good solution because all our clients were kind of similar in capability. Now I think the fastest/slowest device is a broader range (just like Tavis says), so it is harder to tune the difficulty for that.
spiritplumber 37 minutes ago ago
For the same reason why cats sit on your keyboard. Because they can
johnisgood 5 hours ago ago
I like hashcash.
https://github.com/factor/factor/blob/master/extra/hashcash/...
https://bitcoinwiki.org/wiki/hashcash
[-]
- loloquwowndueo 3 hours ago ago
  Anubis is based on hashcash concepts - just adapted to a web request flow. Basically the same thing - moderately expensive for the sender/requester to compute, insanely cheap for the server/recipient to verify.
iefbr14 11 hours ago ago
I wouldn't be surprised if just delaying the server response by some 3 seconds will have the same effect on those scrapers as Anubis claims.
[-]
- kingstnap 7 hours ago ago
  There is literally no point wasting 3 seconds of a computer's time and it's expensive wasting 3 seconds of a person's time.
  That is literally an anti-human filter.
  [-]
  - Imustaskforhelp 6 hours ago ago
    From tjhorner on this same thread
    "Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project."
    So its meant/preferred to block low effort crawlers which can still cause damage if you don't deal with them. a 3 second deterrent seems good in that regard. Maybe the 3 second deterrent can come as in rate limiting an ip? but they might use swath's of ip :/
    [-]
    - OkayPhysicist 4 hours ago ago
      Anubis exists specifically to handle the problem of bots dodging IP rate limiting. The challenge is tied to your IP, so if you're cycling IPs with every request, you pay dramatically more PoW than someone using a single IP. It's intended to be used in depth with IP rate limiting.
  - loeg 2 hours ago ago
    Anubis easily wastes 3 seconds of a human's time already.
  - psionides an hour ago ago
    You've just described Anubis, yeah
- ranger_danger 8 hours ago ago
  Yea I'm not convinced unless somehow the vast majority of scrapers aren't already using headless browsers (which I assume they are). I feel like all this does is warm the planet.
Borg3 5 hours ago ago
Oh, its time to bring Internet back to humans. Maybe its time to treat first layer of Internet just as transport. Then, layer large VPN networks and put services there. People will just VPN to vISP to reach content. Different networks, different interests :) But this time dont fuck up abuse handling. Someone is doing something fishy? Depeer him from network (or his un-cooperating upstream!).
heap_perms 3 hours ago ago
> I host this blog on a single core 128MB VPS
No wonder the site is being hugged to death. 128MB is not a lot. Maybe it's worth to upgrade if you post to hacker news. Just a thought.
[-]
- bawolff 3 hours ago ago
  It doesnt take much to host a static website. Its all the dynamic stuff/frameworks/db/etc that bogs everything down.
qwertytyyuu 2 hours ago ago
Isn’t animus a dog? So it should be anime dog/wolf girl rather than cat girl?
[-]
- Twisol 2 hours ago ago
  Yes, Anubis is a dog-headed or jackal-headed god. I actually can't find anywhere on the Anubis website where they talk about their mascot; they just refer to her neutrally as the "default branding".
  Since dog girls and cat girls in anime can look rather similar (both being mostly human + ears/tail), and the project doesn't address the point outright, we can probably forgive Tavis for assuming catgirl.
0003 2 hours ago ago
Soon any attempt to actually do it would indicate you're a bot.
andromaton 4 hours ago ago
Hug of death https://archive.ph/BSh1l
Philpax 11 hours ago ago
The argument isn't that it's difficult for them to circumvent - it's not - but that it adds enough friction to force them to rethink how they're scraping at scale and/or self-throttle.
I personally don't care about the act of scraping itself, but the volume of scraping traffic has forced administrators' hands here. I suspect we'd be seeing far fewer deployments if the scrapers behaved themselves to begin with.
[-]
- davidclark 11 hours ago ago
  The OP author shows that the cost to scrape an Anubis site is essentially zero since it is a fairly simple PoW algorithm that the scraper can easily solve. It adds basically no compute time or cost for a crawler run out of a data center. How does that force rethinking?
  [-]
  - Philpax 11 hours ago ago
    The cookie will be invalidated if shared between IPs, and it's my understanding that most Anubis deployments are paired with per-IP rate limits, which should reduce the amount of overall volume by limiting how many independent requests can be made at any given time.
    That being said, I agree with you that there are ways around this for a dedicated adversary, and that it's unlikely to be a long-term solution as-is. My hope is that the act of having to circumvent Anubis at scale will prompt some introspection (do you really need to be rescraping every website constantly?), but that's hopeful thinking.
    [-]
    - yborg 7 hours ago ago
      >do you really need to be rescraping every website constantly Yes, because if you believe you out-resource your competition, by doing this you deny them training material.
  - hooverd 11 hours ago ago
    The problem with crawlers if that they're functionally indistinguishable from your average malware botnet in behavior. If you saw a bunch of traffic from residential IPs using the same token that's a big tell.
jchw 4 hours ago ago
> This… makes no sense to me. Almost by definition, an AI vendor will have a datacenter full of compute capacity. It feels like this solution has the problem backwards, effectively only limiting access to those without resources or trying to conserve them.
A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
Secondly, Anubis specifically targets bots that try to blend in with human traffic. Bots that don't try to blend in with humans are basically ignored and out-of-scope. Most malicious bots don't want to be targeted, so they want to blend in... so they kind of have to deal with this. If they want to avoid the Anubis challenge, they have to essentially identify themselves. If not, they have to solve it.
Finally... If bots really want to durably be able to pass Anubis challenges, they pretty much have no choice but to run the arbitrary code. Anything else would be a pretty straight-forward cat and mouse game. And, that means that being able to accelerate the challenge response is a non-starter: if they really want to pass it, and not appear like a bot, the path of least resistance is to simply run a browser. That's a big hurdle and definitely does increase the complexity of scraping the Internet. It increases more the more sites that use this sort of challenge system. While the scrapers have more resources, tools like Anubis scale the resources required a lot more for scraping operations than it does a specific random visitor.
To me, the most important point is that it only fights bot traffic that intentionally tries to blend in. That's why it's OK that the proof-of-work challenge is relatively weak: the point is that it's non-trivial and can't be ignored, not that it's particularly expensive to compute.
If bots want to avoid the challenge, they can always identify themselves. Of course, then they can also readily be blocked, which is exactly what they want to avoid.
In the long term, I think the success of this class of tools will stem from two things:
1. Anti-botting improvements, particularly in the ability to punish badly behaved bots, and possibly share reputation information across sites.
2. Diversity of implementations. More implementations of this concept will make it harder for bots to just hardcode fastpath challenge response implementations and force them to actually run the code in order to pass the challenge.
I haven't kept up with the developments too closely, but as silly as it seems I really do think this is a good idea. Whether it holds up as the metagame evolves is anyone's guess, but there's actually a lot of directions it could be taken to make it more effective without ruining it for everyone.
[-]
- o11c 3 hours ago ago
  > A lot of these bots consume a shit load of resources specifically because they don't handle cookies, which causes some software (in my experience, notably phpBB) to consume a lot of resources. (Why phpBB here? Because it always creates a new session when you visit with no cookies. And sessions have to be stored in the database. Surprise!) Forcing the bots to store cookies to be able to reasonably access a service actually fixes this problem altogether.
  ... has phpbb not heard of the old "only create the session on the second visit, if the cookie was successfully created" trick?
  [-]
  - jchw 3 hours ago ago
    phpBB supports browsers that don't support or accept cookies: if you don't have a cookie, the URL for all links and forms will have the session ID in it. Which would be great, but it seems like these bots are not picking those up either for whatever reason.
fluoridation 11 hours ago ago
Hmm... What if instead of using plain SHA-256 it was a dynamically tweaked hash function that forced the client to run it in JS?
[-]
- jsnell 9 hours ago ago
  No, the economics will never work out for a Proof of Work-based counter-abuse challenge. CPU is just too cheap in comparison to the cost of human latency. An hour of a server CPU costs $0.01. How much is an hour of your time worth?
  That's all the asymmetry you need to make it unviable. Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users. So there's no point in theorizing about an attacker solving the challenges cheaper than a real user's computer, and thus no point in trying to design a different proof of work that's more resistant to whatever trick the attackers are using to solve it for cheap. Because there's no trick.
  [-]
  - pavon 6 hours ago ago
    But for a scraper to be effective it has to load orders of magnitude more pages than a human browses, so a fixed delay causes a human to take 1.1x as long, but it will slow down scraper by 100x. Requiring 100x more hardware to do the same job is absolutely a significant economic impediment.
    [-]
    - jsnell 3 hours ago ago
      The entire problem is that proof of work does not increase the cost of scraping by 100x. It does not even increase it by 100%. If you run the numbers, a reasonable estimate is that it increases the cost by maybe 0.1%. It is pure snakeoil.
  - fluoridation 8 hours ago ago
    >An hour of a server CPU costs $0.01. How much is an hour of your time worth?
    That's irrelevant. A human is not going to be solving the challenge by hand, nor is the computer of a legitimate user going to be solving the challenge continuously for one hour. The real question is, does the challenge slow down clients enough that the server does not expend outsized resources serving requests of only a few users?
    >Even if the attacker is no better at solving the challenge than your browser is, there's no way to tune the monetary cost to be even in the ballpark to the cost imposed to the legitimate users.
    No, I disagree. If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
    [-]
    - michaelt 6 hours ago ago
      The problem with proof-of-work is many legitimate users are on battery-powered, 5-year-old smartphones. While the scraping servers are huge, 96-core, quadruple-power-supply beasts.
    - jsnell 8 hours ago ago
      The human needs to wait for their computer to solve the challenge.
      You are trading something dirt-cheap (CPU time) for something incredibly expensive (human latency).
      Case in point:
      > If the challenge takes, say, 250 ms on the absolute best hardware, and serving a request takes 25 ms, a normal user won't even see a difference, while a scraper will see a tenfold slowdown while scraping that website.
      No. A human sees a 10x slowdown. A human on a low end phone sees a 50x slowdown.
      And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
      That is not an effective deterrent. And there is no difficulty factor for the challenge that will work. Either you are adding too much latency to real users, or passing the challenge is too cheap to deter scrapers.
      [-]
      - fluoridation 7 hours ago ago
        >No. A human sees a 10x slowdown.
        For the actual request, yes. For the complete experience of using the website not so much, since a human will take at least several seconds to process the information returned.
        >And the scraper paid one 1/1000000th of a dollar. (The scraper does not care about latency.)
        The point need not be to punish the client, but to throttle it. The scraper may not care about taking longer, but the website's operator may very well care about not being hammered by requests.
        [-]
        avhon1 6 hours ago ago
        But now I have to wait several seconds before I can even start to process the webpage! It's like the internet suddenly became slow again overnight.
        [-]
        fluoridation 5 hours ago ago
        Yeah, well, bad actors harm everyone. Such is the nature of things.
        jsnell 7 hours ago ago
        A proof of work challenge does not throttle the scrapers at steady state. All it does is add latency and cost to the first request.
        [-]
        fluoridation 6 hours ago ago
        Hypothetically, the cookie could be used to track the client and increase the difficulty if its usage becomes abusive.
        [-]
        soulofmischief 5 hours ago ago
        Yes, and then we can avoid the entire issue. It's patronizing for people to assume users wouldn't notice a 10x or 50x slowdown. You can tell those who think that way are not web developers, as we know that every millisecond has a real, nonlinear fiscal cost.
        Of course, then the issue becomes "what is the latency and cost incurred by a scraper to maintain and load balance across a large list of IPs". If it turns out that this is easily addressed by scrapers then we need another solution. Perhaps, the user's browser computes tokens in the background and then serves them to sites alongside a certificate or hash (to prevent people from just buying and selling these tokens).
        We solve the latency issue by moving it off-line, and just accept the tradeoff that a user is going to have to spend compute periodically in order to identify themselves in an increasingly automated world.
- VMG 11 hours ago ago
  crawlers can run JS, and also invest into running the Proof-Of-JS better than you can
  [-]
  - tjhorner 10 hours ago ago
    Anubis doesn't target crawlers which run JS (or those which use a headless browser, etc.) It's meant to block the low-effort crawlers that tend to make up large swaths of spam traffic. One can argue about the efficacy of this approach, but those higher-effort crawlers are out of scope for the project.
    [-]
    - Imustaskforhelp 6 hours ago ago
      reminds of how wikipedia literally has all the data available even in a nice format just for scrapers (I think) and even THEN, there are some scrapers which still scraped wikipedia and actually made wikipedia lose some money so much that I am pretty sure that some official statement had to be made or they disclosed about it without official statement.
      Even then, man I feel like you yourself can save on so many resources (both yours) and (wikipedia) if scrapers had the sense to not scrape wikipedia and instead follow wikipedia's rules
  - fluoridation 11 hours ago ago
    If we're presupposing an adversary with infinite money then there's no solution. One may as well just take the site offline. The point is to spend effort in such a way that the adversary has to spend much more effort, hopefully so much it's impractical.
ksymph 11 hours ago ago
Reading the original release post for Anubis [0], it seems like it operates mainly on the assumption that AI scrapers have limited support for JS, particularly modern features. At its core it's security through obscurity; I suspect that as usage of Anubis grows, more scrapers will deliberately implement the features needed to bypass it.
That doesn't necessarily mean it's useless, but it also isn't really meant to block scrapers in the way TFA expects it to.
[0] https://xeiaso.net/blog/2025/anubis/
[-]
- jhanschoo 10 hours ago ago
  Your link explicitly says:
  > It's a reverse proxy that requires browsers and bots to solve a proof-of-work challenge before they can access your site, just like Hashcash.
  It's meant to rate-limit accesses by requiring client-side compute light enough for legitimate human users and responsible crawlers in order to access but taxing enough to cost indiscriminate crawlers that request host resources excessively.
  It indeed mentions that lighter crawlers do not implement the right functionality in order to execute the JS, but that's not the main reason why it is thought to be sensible. It's a challenge saying that you need to want the content bad enough to spend the amount of compute an individual typically has on hand in order to get me to do the work to serve you.
  [-]
  - ksymph 10 hours ago ago
    Here's a more relevant quote from the link:
    > Anubis is a man-in-the-middle HTTP proxy that requires clients to either solve or have solved a proof-of-work challenge before they can access the site. This is a very simple way to block the most common AI scrapers because they are not able to execute JavaScript to solve the challenge. The scrapers that can execute JavaScript usually don't support the modern JavaScript features that Anubis requires. In case a scraper is dedicated enough to solve the challenge, Anubis lets them through because at that point they are functionally a browser.
    As the article notes, the work required is negligible, and as the linked post notes, that's by design. Wasting scraper compute is part of the picture to be sure, but not really its primary utility.
    [-]
    - kevincox 4 hours ago ago
      Why require proof of work with difficulty at all then? Just have no UI other than (javascript) required and run a trivial computation in WASM as a way of testing for modern browser features. That way users don't complain that it is taking 30s on their low-end phone and it doesn't make it any easier for scrapers to scrape (because the PoW was trivial anyways).
  - ranger_danger 8 hours ago ago
    The compute also only seems to happen once, not for every page load, so I'm not sure how this is a huge barrier.
    [-]
    - untilted 5 hours ago ago
      Once per ip. Presumably there's ip-based rate limiting implemented on top of this, so it's a barrier for scrapers that aggressively rotate ip's to circumvent rate limits.
    - debugnik 5 hours ago ago
      It happens once if the user agent keeps a cookie that can be used for rate limiting. If a crawler hits the limit they need to either wait or throw the cookie away and solve another challenge.
johnea 7 hours ago ago
My biggest bitch is that it requires JS and cookies...
Although the long term problem is the business model of servers paying for all network bandwidth.
Actual human users have consumed a minority of total net bandwidth for decades:
https://www.atom.com/blog/internet-statistics/
Part 4 shows bots out using humans in 1996 8-/
What are "bots"? This needs to include goggleadservices, PIA sharing for profit, real-time ad auctions, and other "non-user" traffic.
The difference between that and the LLM training data scraping, is that the previous non-human traffic was assumed, by site servers, to increase their human traffic, through search engine ranking, and thus their revenue. However the current training data scraping is likely to have the opposite effect: capturing traffic with LLM summaries, instead of redirecting it to original source sites.
This is the first major disruption to the internet's model of finance since ad revenue look over after the dot bomb.
So far, it's in the same category as the environmental disaster in progress, ownership is refusing to acknowledge the problem, and insisting on business as usual.
Rational predictions are that it's not going to end well...
[-]
- jerf 6 hours ago ago
  "Although the long term problem is the business model of servers paying for all network bandwidth."
  Servers do not "pay for all the network bandwidth" as if they are somehow being targeted for fees and carrying water for the clients that are somehow getting it for "free". Everyone pays for the bandwidth they use, clients, servers, and all the networks in between, one way or another. Nobody out there gets free bandwidth at scale. The AI scrapers are paying lots of money to scrape the internet at the scales they do.
  [-]
  - Imustaskforhelp 6 hours ago ago
    The Ai scrapers are most likely vc funded and all they care about is getting as much data as possible and not worry about the costs.
    They are hiring machines at scale too so definitely bandwidth etc. are cheaper for them too. Maybe use a provider that doesn't have too much bandwidth issues (hetzner?)
    But still, the point being that you might be hosting website on your small server and that scraper with its machines beast can come and effectively ddos your server looking for data to scrape. Deterring them is what matters so that the economical scale finally slide back to our favours again.
- Hizonner 5 hours ago ago
  > The difference between that and the LLM training data scraping
  Is the traffic that people are complaining about really training traffic?
  My SWAG would be that there are maybe on the order of dozens of foundation models trained in a year. If you assume the training runs are maximally inefficient, cache nothing, and crawl every Web site 10 times for each model trained, then that means maybe a couple of hundred full-content downloads for each site in a year. But really they probably do cache, and they probably try to avoid downloading assets they don't actually want to put into the training hopper, and I'm not sure how many times they feed any given page through in a single training run.
  That doesn't seem like enough traffic to be a really big problem.
  On the other hand, if I ask ChatGPT Deep Research to give me a report on something, it runs around the Internet like a ferret on meth and maybe visits a couple of hundred sites (but only a few pages on each site). It'll do that a whole lot faster than I'd do it manually, it's probably less selective about what it visits than I would be... and I'm likely to ask for a lot more such research from it than I'd be willing to do manually. And the next time a user asks for a report, it'll do it again, often on the same sites, maybe with caching and maybe not.
  Thats not training; the results won't be used to update any neural network weights, and won't really affect anything at all beyond the context of a single session. It's "inference scraping" if you will. It's even "user traffic" in some sense, although not in the sense that there's much chance the user is going to see a site's advertising. It's conceivable the bot might check the advertising for useful information, but of course the problem there is that it's probably learned that's a waste of time.
  Not having given it much thought, I'm not sure how that distinction affects the economics of the whole thing, but I suspect it does.
  So what's really going on here? Anybody actually know?
  [-]
  - zerocrates 4 hours ago ago
    The traffic I've seen is the big AI players just voraciously scraping for ~everything. What they do with it, if anything, who knows.
    There's some user-directed traffic, but it's a small fraction, in my experience.
  - ncruces 2 hours ago ago
    It's not random internet people saying it's training. It's Cloudflare, among others.
    Search for “A graph of daily requests over time, comparing different categories of AI Crawlers” on this blog: https://blog.cloudflare.com/ai-labyrinth/
  - Dylan16807 5 hours ago ago
    The traffic I'm seeing on a wiki I host looks like plain old scraping. When it hits it's a steady load of lots of traffic going all over, from lots of IPs. And they really like diffs between old page revisions for some reason.
    [-]
    - Hizonner 4 hours ago ago
      That sounds like a really dumb scraper indeed. I don't think you'd want to feed very many diffs into a training run or most inference runs.
      But if there's a (discoverable) page comparing every revision of a page to every other revision, and a page has N revisions, there are going to be (N^2-N)/2 delta pages, so could it just be the majority of the distinct pages your Wiki has are deltas?
      I would think that by now the "AI companies" would have something smarter steering their scrapers. Like, I dunno, some kind of AI. But maybe they don't for some reason? Or maybe the big ones do, but smaller "hungrier" ones, with less staff but still probably with a lot of cash, are willing to burn bandwidth so they don't have to implement that?
      The questions just multiply.
      [-]
      - Dylan16807 4 hours ago ago
        It's near-stock mediawiki, so it has a ton of old versions and diffs off the history tab but I'd expect a crawler to be able to handle it.
ge96 5 hours ago ago
Oh I saw this recently on ffmpeg's site, pretty fun
serf 5 hours ago ago
I don't care that they use anime catgirls.
What I do care about is being met with something cutesy in the face of a technical failure anywhere on the net.
I hate Amazon's failure pets, I hate google's failure mini-games -- it strikes me as an organizational effort to get really good at failing rather than spending that same effort to avoid failures all together.
It's like everyone collectively thought the standard old Apache 404 not found page was too feature-rich and that customers couldn't handle a 3 digit error, so instead we now get a "Whoops! There appears to be an error! :) :eggplant: :heart: :heart: <pet image.png>" and no one knows what the hell is going on even though the user just misplaced a number in the URL.
[-]
- xandrius 4 hours ago ago
  The original versions were a way to make fun even a boring event such as a 404. If the page stops conveying the type of error to the user then it's just bad UX but also vomiting all the internal jargon to a non-tech user is bad UX.
  So, I don't see an error code + something fun to be that bad.
  People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today, so I don't see how having fun error pages to be such an issue?
  [-]
  - Hizonner 4 hours ago ago
    This assumes it's fun.
    Usually when I hit an error page, and especially if I hit repeated errors, I'm not in the mood for fun, and I'm definitely not in the mood for "fun" provided by the people who probably screwed up to begin with. It comes off as "oops, we can't do anything useful, but maybe if we try to act cute you'll forget that".
    Also, it was more fun the first time or two. There's a not a lot of orginal fun on the error pages you get nowadays.
    > People love dreaming of the 90s wild web and hate the clean cut soulless corp web of today
    It's been a while, but I don't remember much gratuitous cutesiness on the 90s Web. Not unless you were actively looking for it.
    [-]
    - doublerabbit 4 hours ago ago
      > This assumes it's fun.
      Not to those who don't exist in such cultures. It's creepy, childish, strange to them. It's not something they see in everyday life, nor would I really want to. There is a reason why cartoons are aimed for younger audiences.
      Besides if your webserver is throwing errors, you've configured it incorrectly. Those pages should be branded as the site design with a neat and polite description to what the error is.
- JdeBP 5 hours ago ago
  Guru Meditations and Sad Macs are not your thing?
  [-]
  - Hizonner 2 hours ago ago
    That also got old when you got it again and again while you were trying to actually do something. But there wasn't the space to fit quite as much twee on the screen...
- pak9rabid 5 hours ago ago
  I hear this
jonathanyc 2 hours ago ago
> The idea of “weighing souls” reminded me of another anti-spam solution from the 90s… believe it or not, there was once a company that used poetry to block spam!
> Habeas would license short haikus to companies to embed in email headers. They would then aggressively sue anyone who reproduced their poetry without a license. The idea was you can safely deliver any email with their header, because it was too legally risky to use it in spam.
Kind of a tangent but learning about this was so fun. I guess it's ultimately a hack for there not being another legally enforceable way to punish people for claiming "this email is not spam"?
IANAL so what I'm saying is almost certainly nonsense. But it seems weird that the MIT license has to explicitly say that the licensed software comes with no warranty that it works, but that emails don't have to come with a warranty that they are not spam! Maybe it's hard to define what makes an email spam, but surely it is also hard to define what it means for software to work. Although I suppose spam never e.g. breaks your centrifuge.
zb3 6 hours ago ago
Anubis doesn't use enough resources to deter AI bots. If you really want to go this way, use React, preferably with more than one UI framework.
yuumei 11 hours ago ago
> The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans. > Anubis – confusingly – inverts this idea.
Not really, AI easily automates traditional captchas now. At least this one does not need extensions to bypass.
anotherhue 11 hours ago ago
Surely the difficulty factor scales with the system load?
jmclnx 7 hours ago ago
>The CAPTCHA forces vistors to solve a problem designed to be very difficult for computers but trivial for humans
Not for me, I have nothing but a hard time solving CAPTCHAs, ahout 50% of the time I give up after 2 tries.
[-]
- serf 7 hours ago ago
  it's still certainly trivial for you compared to mentally computing a SHA256 op.
raffraffraff 5 hours ago ago
HN hug of death
[-]
- mr_toad 4 hours ago ago
  I’m getting a black page. Not sure if it’s an ironic meta commentary, or just my ad blocker.
tonymet 4 hours ago ago
So it's a paywall with -- good intentions -- and even more accessibility concerns. Thus accelerating enshittification.
Who's managing the network effects? How do site owners control false positives? Do they have support teams granting access? How do we know this is doing any good?
It's convoluted security theater mucking up an already bloated , flimsy and sluggish internet. It's frustrating enough to guess schoolbuses every time I want to get work done, now I have to see porfnified kitty waifus
(openwrt is another community plagued with this crap)
immibis 9 hours ago ago
The actual answer to how this blocks AI crawlers is that they just don't bother to solve the challenge. Once they do bother solving the challenge, the challenge will presumably be changed to a different one.
senectus1 2 hours ago ago
the action is great, anubis is a very clever idea i love it.
I'm not a huge fan of the anime thing, but i can live with it.
WesolyKubeczek 11 hours ago ago
I disagree with the post author in their premise that things like Anubis are easy to bypass if you craft your bot well enough and throw the compute at it.
Thing is, the actual lived experience of webmasters tells that the bots that scrape the internets for LLMs are nothing like crafted software. They are more like your neighborhood shit-for-brain meth junkies competing with one another who makes more robberies in a day, no matter the profit.
Those bots are extremely stupid. They are worse than script kiddies’ exploit searching software. They keep banging the pages without regard to how often, if ever, they change. If they were 1/10th like many scraping companies’ software, they wouldn’t be a problem in the first place.
Since these bots are so dumb, anything that is going to slow them down or stop them in their tracks is a good thing. Short of drone strikes on data centers or accidents involving owners of those companies that provide networks of botware and residential proxies for LLM companies, it seems fairly effective, doesn’t it?
[-]
- int_19h 5 hours ago ago
  It is the way it is because there are easy pickings to be made even with this low effort, but the more sites adopt such measures, the less stupid your average bot will be.
- busterarm 6 hours ago ago
  Those are just the ones that you've managed to ID as bots.
  Ask me how I know.
superkuh 7 hours ago ago
Kernel.org* just has to actually configure Anubis rather than deploying the default broken config. Enable the meta-refresh proof of work rather than relying on the corporate browsers only bleeding edge javascript application proof of work.
* or whatever site the author is talking about, his site is currently inaccessible due to the amount of people trying to load it.
lousken 11 hours ago ago
aren't you happy? at least you see catgirl
efilife 5 hours ago ago
This cartoon mascot has absolutely nothing to do with anime
If you disagree, please say why
lxgr 11 hours ago ago
> This isn’t perfect of course, we can debate the accessibility tradeoffs and weaknesses, but conceptually the idea makes some sense.
It was arguably never a great idea to begin with, and stopped making sense entirely with the advent of generative AI.
rnhmjoj 11 hours ago ago
I don't understand, why do people resort to this tool instead of simply blocking by UA string or IP address. Are there so many people running these AI crawlers?
I blackholed some IP blocks of OpenAI, Mistral and another handful of companies and 100% of this crap traffic to my webserver disappeared.
[-]
- mnmalst 11 hours ago ago
  Because that solution simply does not work for all. People tried and the crawlers started using proxies with residential IPs.
- hooverd 11 hours ago ago
  less savory crawlers use residential proxies and are indistinguishable from malware traffic
- busterarm 6 hours ago ago
  Lots of companies run these kind of crawlers now as part of their products.
  They buy proxies and rotate through proxy lists constantly. It's all residential IPs, so blocking IPs actually hurts end users. Often it's the real IPs of VPN service customers, etc.
  There are lots of companies around that you can buy this type of proxy service from.
- WesolyKubeczek 11 hours ago ago
  You should read more. AI companies use residential proxies and mask their user agents with legitimate browser ones, so good luck blocking that.
  [-]
  - rnhmjoj 10 hours ago ago
    Which companies are we talking about here? In my case the traffic was similar to what was reported here[1]: these are crawlers from Google, OpenAI, Amazon, etc. they are really idiotic in behaviour, but at least report themselves correctly.
    [1]: https://pod.geraspora.de/posts/17342163
    [-]
    - nemothekid 5 hours ago ago
      OpenAI/Anthropic/Perplexity aren't the bad actors here. If they are, they are relatively simply to block - why would you implement an Anubis PoW MITM Proxy, when you could just simply block on UA?
      I get the sense many of the bad actors are simply poor copycats that are poorly building LLMs and are scraping the entire web without a care in the world
  - majorchord 8 hours ago ago
    > AI companies use residential proxies
    Source:
    [-]
    - Macha 7 hours ago ago
      Source: Cloudflare
      https://blog.cloudflare.com/perplexity-is-using-stealth-unde...
      Perplexity's defense is that they're not doing it for training/KB building crawls but for answering dynamic queries calls and this is apparently better.
      [-]
      - Dylan16807 4 hours ago ago
        Well yes it is better. It's a page load triggered by a user for their own processing.
        If web security worked a little differently, the requests would likely come from the user's browser.
      - ranger_danger 6 hours ago ago
        I do not see the words "residential" or "proxy" anywhere in that article... or any other text that might imply they are using those things. And personally... I don't trust crimeflare at all. I think they and their MITM-as-a-service has done even more/lasting damage to the global Internet and user privacy in general than all AI/LLMs combined.
        However, if this information is accurate... perhaps site owners should allow AI/bot user agents but respond with different content (or maybe a 404?) instead, to try to prevent it from making multiple requests with different UAs.
        [-]
        Symbiote 3 hours ago ago
        I had 500,000 residential IPs make 1-4 requests each in the past couple of days.
        These had the same user agent (latest Safari), but previously the agent has been varied.
        Blocking this shit is much more complicated than any blocking necessary before 2024.
        The data is available for free download in bulk (it's a university) and this is advertised in several places, including the 429 response, the HTML source and the API documentation, but the AI people ignore this.
jayrwren 11 hours ago ago
literally the top link when I search for his exact text "why are anime catgirls blocking my access to the Linux kernel?" https://lock.cmpxchg8b.com/anubis.html Maybe travis needs more google-fu. maybe that includes using duckduckgo?
[-]
- Macha 7 hours ago ago
  The top link when you search the title of the article is the article itself?
  I am shocked, shocked I say.