I think I actually saw a question on SO way back during the Windows Vista era when some guy asked if Windows supported machines with odd number of cores/processors, and the answer was "well, 1 is an odd number, you know".
Another joke from the same era: Having a 2 core processor means that you can now e.g. watch a film at the same time. At the same time with what? At the same time with running Windows Vista!
The fun thing about those is they were physically quad cores with one core disabled, which may or may not have been defective, so if you were lucky you could unlock it and get a bonus core for free.
It’s also frequently wrong when running in Docker. Some of that is libuv’s fault, some of it is cgroups deciding not to mask off /proc values that are wrong in the cgroup.
I might be wrong but I think I heard that Firefox and maybe Safari bin that to a couple common values. (Or maybe that's just in the tracking-prevention mode that Tor uses?)
I always found it annoying that CPU information was widely available and precise while memory information was not - it's clamped to 0.25, 0.5, 1, 2, 4 or 8 GB. If you're running something memory-bound in the browser you have to be really conservative to avoid locking up the user's device (or ask them to manually specify how much memory to use).
https://developer.mozilla.org/en-US/docs/Web/API/Device_Memo...
> In retrospect implementing the proof of work challenge may have been a mistake and it's likely to be supplanted by things like Proof of React or other methods that have yet to be developed.
> ... a challenge method that requires the client to do a single round of SHA-256 hashing deeply nested into a Preact hook in order to prove that the client is running JavaScript.
Why a single round? Doing the whole proof of work challenge inside the proof of react would be even more effective, right?
The whole Anubis thing is a really interesting predicament for me.
I have Chrome on mobile configured as such that JS and cookies are disabled by default, and then I enable them per site based on my judgement. You might be surprised to learn that normally, this actually works fine, and sites are usually better for it. They stop nagging, and load faster. This makes some sense in retrospect, as this is what allows search engine crawlers to do their thing and get that SEO score going.
Anubis (and Cloudflare for that matter) force me to temporarily enable JS and cookies at least once however anyways, completely defeating the purpose of my paranoid settings. I basically never bother to, but I do admit it is annoying. It's kind of up there with sites that don't have any content by default, only with JS on (high profile example: AWS docs). At least Cloudflare only spoils the fun every now and then. With Anubis, it's always.
It's definitely my fault, but at the same time, I don't feel this is right. Simple static pages now require allowing arbitrary code execution and statefulness. (Although I do recognize that SVGs and fonts also kind of do so anyhow, much to my further annoyance).
> I have Chrome on mobile configured as such that JS and cookies are disabled by default
My God, there's two of us!
(Though … you're being privacy conscious on Chrome? Come to Firefox. Ignore the pesky "it's funded by Google" problems, nothing to see, nothing to see, the water is fiiiine.)
> You might be surprised to learn that normally, this actually works fine
I guess I have a different experience there. A huge number of sites just outright crash. (E.g., the HN search.) JavaScript devs, I've learned, do not handle error cases, and the exceptions tend to just propagated out and ruin the rendering. There seems to be some popular framework out there that even just destroys the whole DOM to emit just the error. (I forget the text, but it's the same text, always. Always centered. Flash of page, then crash.)
I have a custom extension that fakes the cookie storage for those JS pages that just lies & says "yeah, cookies are enabled" and the blackholes the writes. But it fails for anything that needs a real cookie … like Anubis.
I'm empathetic towards where Anubis is coming from though. But the "I passed the challenge" cookie is indistinguishable from a tracker … although probably most people running Anubis are inherently trustworthy by a sort of cultural association so long as Anubis remains non-mainstream. I think I might modify it to have the ability to store cookies for a short time frame (like 1h) in some cases, such as Anubis; that's enough to pass the challenge, but weighed against tracking. I'm usually only blocked by Anubis for something like a blog post, so that should suffice.
and that will load maybe 20MB of stuff (always the same thing) and eventually after the JS boots up a useEffect() gets called that reads '88841' out of the URL and does a GET to
which gets you nicely formatted JSON. On top of that the public id(s) are sequential integers so you could easily enumerate all the items if you just thought a little bit.
We've had more than one obnoxious crawler that we had reason to believe was targeted specifically at us that would go to the /web/ URL and, without a cache, download all the HTML, Javascript, CSS, then run the JS and download the JSON for each page -- at which case they are either saving the generated HTML or looking at the DOM. If they'd spent 10 minutes playing with the browser dev tools they would have seen the /item/ request and probably could have figured out pretty quickly out how to interpret the results. As is they're going to have to figure out how to parse that HTML and turn it into something like the JSON and could probably save them 95% of the bandwidth, 95% of the CPU, and whatever time they spent writing parsing code and managing their Rube Goldberg machine but I'd take 50% odds any day that they never actually did anything with the data they captured because crawlers usually don't.
I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once. [1] (There's a running gag in my pod that I can't visit the state of Delaware because of my webcrawling)
[1] Ok, sometimes the way you avoid trouble is hit slow, hit soft, but still hit once. It's a judgement call if you can hit them before they knew what hit them or if you can blend in with the rest of the traffic.
> I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once.
I have no problem with bots scraping all my data, I have a problem with poorly-coded bots overloading my server, making it unusable for anybody else. I'm using Anubis on the web interface to an SVN server, so if the bots actually wanted the data, they could just run "svn co" instead of trying to scrape the history pages for 300k files.
> It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.
I'm also rather unhappy that I had to deploy Anubis, but it's unfortunately the only thing that seemed to work, and the server load was getting so bad that the alternative was just disabling the SVN web interface altogether.
Stock mobile Chrome doesn't support extensions and doesn't have user agent manipulation capabilities unfortunately, so that's a no-go.
My options are using custom Chrome, migrating to Firefox, or proxying my traffic and making edits that way (e.g. doing the Anubis PoW there and injecting the cookie required).
Not stoked about any of these, although Firefox is a lot on my mind these days, and option #3 would be a good excuse to dust off my RPi.
Firefox's sandboxing still wasn't anywhere near as robust as Chrome's from a security perspective last time I checked, but fwiw, FF mobile has full FF extension support these days, including full-fat uBlock Origin.
I refuse to be boiled slowly by Google. With MV3, it was full-fat ad blockers. With MV4 it could very well be ALL ad-blockers.
And yeah, I concede that sounds conspiratorial - as conspiratorial as Google cracking down on your ability to run the ad-blocker of your choice would've sounded a decade ago.
If you are doing JS whitelisting then in terms of security, you're already far ahead of everyone else who just has it on by default (and isn't blocking anything either.)
You say paranoid, I say sensible. My browsers are configured almost the same way. (I'm fine with temporarily enabling cookies, but scripts are unwelcome.)
Anubis has become an annoying denial-of-service layer in front of sites that I would otherwise use. I hope its no-script mode gets enabled by default soon.
We have nothing to protect sites against scrapers except to make it more expensive for everyone’s, unless privacy-compromising or authority-trusting methods are on the table.
Making you pay time, power, bandwidth, or money to access content does not significantly impede your browsing, so long as the cost is appropriately small. For the user above reporting thirty seconds of maxcpu, that’s excessive for a median normal person (but us hackers are not that).
If giving your unique burned-in crypto-attested device ID is acceptable, there’s an entire standard for that, and when your device is found to misbehave, your device can be banned. Nintendo, Sony, Xbox call this a “console ban”; it’s quite effective because it’s stunningly expensive to replace a device.
If submitting proof of citizenship through whatever attestation protocol is palatable is okay, the Anubis could simply add the digital ID web standard and let users skip the proof of work in exchange for affirming that they have a valid digital ID. But this only works if your specific identity can be banned, or else AI crawlers will just send a valid anonymized digital ID header.
This problem repeats in every suggested outcome: either you make it more difficult for users to access a site, or you require users to waste energy to access a site, or you require identifiable information signed by a dependable third-party authority to be presented such that a ban is possible based on it. IP addresses don’t satisfy this; Apple IDs, trusted-vendor HSM-protected device identifiers, and digital passports do satisfy this.
If you have a solution that only presents barriers to excessive use and allows abusive traffic to be revoked without depending on IP address, browser fingerprint, or paid/state credentials, then you can make billions of dollars in twelve months.
Ideas welcome! This has been a problem since bots started scraping RSS feeds and republishing them as SEO blogs, and we still don’t have a solution besides Cloudflare and/or CPU-burning interstitials.
(ps. I do have a solution for this, but it would require physical builds, be mildly unprofitable over time with no growth potential, and incite governments hostility towards privacy-preserving identity systems. A billionaire philanthropist could build it in a year and completely solve this problem. Sigh.)
I actually do not have a problem with digital IDs, as long as my personal identity isn't being shared alongside it. Not to the site operator, not to the government.
This might seem contradictory, but I believe this is technically possible? What I don't think is this is how these solutions actually work currently. Like to basically prove that I am indeed a unique visitor who's a person according to the govt, but wouldn't reveal the person info to the site, and wouldn't reveal the site info to the govt, even if they collude.
Same with the whole +18 goof. I'd actually quite like to try age gated communities, like +-5 years my age. I feel a lot of conflict stems from people coming from a bit too different walks of life sometimes. Could even do high confidence location based gating this way, which could also be cool (as well as the exact opposite of cool, because of course).
The core of the identity problem is, “how can someone ban you personally without being able to identify you?”, and so far as I know, “a trusted third-party checks your identity and issues you an identifier” is the best we’ve got - but of course those identifiers can be enriched, so you end up needing a single centralized third-party that can issue identities and also honor bans for “the individual behind identity X” from specific sites by site request, while being audit-proven to actually enforce those bans upon me regardless of how many anonymous identities I choose to generate and use. (If you don’t run this as a monopoly, either they all pool their
user-operator banned identities lists to prevent ban evasion, which will eventually leak, or they will be compelled to someday later down the road when they lose the inevitable monopoly lawsuits.)
It’s not difficult to solve this problem — the database schema and queries are dead simple! — it’s just exceedingly difficult to succeed if you're not a passport-issuing entity or an authorized monopoly of such.
I wrestle with this by just separating this concern.
In the model I described, the trust anchor would be the govt, so basically a centralized model like domain certs. This resolves the issues you list off, but brings others: what if the trust anchor isn't trustworthy and starts forging identities?
The alternative to that would then be web of trust stuff. But this is why I consider this to be a separate problem. If the core protocol could be laid out and standardized at least, then layering on another that makes this choice between centralized vs web of trust could be done separately.
Person W is welcome to have thousands of unique IDs if they want to, so long as when site X bans identity Y, that ban is applied to all of Person W’s present and future identities. Whether W has a single Y or a thousand Y makes no difference to me. I suppose some sites will care to restrict participation to a single Y per W, but e.g. in the general browsing a site with crawler/bot/AI shielding such as Anubis today, it’s completely irrelevant to them what your Y is so long as rate limits and bans apply to all Y of W rather than to the presented Y alone.
So the user generates the ID for each site he visits? What prevents them from generating arbitrary IDs?
The only way I can imagine this working is:
1. You go to the government and request to have a digital ID generated.
2. The government generates a random number.
3. The government issues a request to an NGO to generate a new cryptographic object based on the random number, and receives back a retrieval number.
4. The government gives you the retrieval number, which you can use to get your digital ID from the NGO.
This way, the government only has the mapping between your identity and a random number, and the NGO only has the mapping between the random number and the generated object, with no possibility to deanonymize it because you don't present any ID to get it. Obviously, there must be no information exchange between the government and the NGO.
> So the user generates the ID for each site he visits? What prevents them from generating arbitrary IDs?
The construction would go basically like this:
pseudonym = VRF(secret_key + site_id)
The expectation is that you would have only one valid secret_key at any time, and it would be unknown to the government. This kind of scheme is called anonymous credential generation in literature I believe. It can be established the secret_key got govt backed, but that's it.
The site_id would be e.g. domain cert public key or similar (domain ownership is a moving target, so just the domain name imo is not sound).
VRF is a verifiable random function. This is the magic ZK part.
Pseudonym is what you present to the site, i.e. the identity you go by.
This way the site can verify that this pseudonym was specifically issued for it (making it site unique), and that it belongs to a govt certified identity (of which there should be only one issued at a time per person). The VRF is deterministic, guaranteeing that it's the same person every time.
Revocation is annoying so I didn't bother thinking that through but should be fairly okay I think?
I believe this is robust to people forging arbitrary IDs, to sites colluding with each other in deanonymization, and colluding with the govt in the same. The only kickers I can think of are secret_key misuse (e.g. via duress) / theft / loss / sharing, and the trust anchor (the govt) being untrustworthy (forging invalid or duplicate identities). Would also need to handle people dying, but that would be pretty much just revocation.
I consider trust anchor issues out of scope. The remainder doesn't sound too bad to try defending for, and I think is also basically out of scope.
Potentially important edit: I'm not accounting for timing side channels here, which might be relevant during revocation or else.
Another: didn't mention but in my humble opinion cryptographically attesting people is unsound. People can't calculate crypto in their head, and can't recall long arbitrary strings of hex. What is appropriate to attest (if anything) is their devices instead. But that's a layer of complication I didn't want to deal with here.
It's an interactively generated thing, so the govt can ensure you can only complete it once, while being ignorant of its content. Or at least that's the claim of these protocols (e.g. Camenisch–Lysyanskaya (CL) signatures) afaik. I'm not sure how they work in detail.
Free credentials temporarily do, but only until someone teachers crawlers to AI-automate signups. Forum spammers already figured this out a long time ago.
A while back someone from Financial Times commented on HN about how it was confusing that we were all so hostile to their free registration required article paywall. The HN community view on their registration requirement suggests that free authentication does not work and has not for some time; has that viewpoint changed in recent years?
You put stuff on the public Internet, expect it to be read by everyone.
Don't like that? Put it behind a login.
How did the propaganda persuade people into accepting mass surveillance and normalising the invasion of privacy for something that was never really a problem?
I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
What a coincidence that "identity verification" became a hot topic recently.
Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
> What a coincidence that "identity verification" became a hot topic recently.
Crying “Conspiracy” in reply to a career Chicken Little is comedic. I’ve been raising warnings about identity verification looming on the horizon for perhaps fifteen years now; thanks to DejaNews for that early realization, I suppose.
> Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
I would celebrate and tell all my friends if someone on this thread, on any thread, would explain how we solve this without bankrupting non-business site operators and without a third-party authority. Anubis is a band-aid at best, yet no better solution — not even an idea — is presented alongside your objections.
> You put stuff on the public Internet, expect it to be read by everyone.
My hobbyist forum can barely stay online eight hours a day due to crawler traffic. Someone scraped the entire site by spawning one request per page with no fork limit last year. It was down for a solid week after that, and now has very severe limits in place. I don’t know how they can afford to stay running, but certainly “static only” isn’t going to solve the CPU and bandwidth costs incurred by incompetent and redundant AI crawlers. So, by making their site public in today’s infested internet, their content is no longer accessible.
> Don't like that? Put it behind a login.
As I noted above, one solution is payment — since free credentials registration is not an obstacle to AI bots, after all. For some reason people don’t like to charge money for hobbyist content if they can avoid it. I recognize why and am trying my best to discover a non-monetary solution on their behalf.
> I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
I do not, have not, and will not run crawlers or AI agents, trainers, or other such shit at any time in the past thirty years and will continue to abstain from the entire category, which should be quite easy as I’m a retired sysop now attending full-time accounting school and giving a finger to the entire industry to pursue work that benefits humanity. Same reason I bother telling HN “the anonymity sky is falling” every so often: I’d much prefer it if we didn’t have to sacrifice anonymity online to defeat scraper bots.
> No, no, no, hell fucking no!!
Please find a way to turn your vehemence and passion into a productive contribution, before it’s too late for all of us. As presented, your argument is neither supported nor persuasive, and your hostility only gives opponents of anonymity more arrows in their quiver to shoot at us.
And, fundamentally it can't be opensource. Bot detection (like anti-fraud more generally) is an adversarial game that relies on hidden techniques. Open-sourcing it means you lose that advantage and make life much easier for anyone trying to get around it.
The schemes large players use to increase the cost of e.g. creating new accounts on their services do in fact rely on obscurity. They target developer cost, not compute cost.
You're computing hash in order to block automated web requests. The hash isn't the point --- the hash cost is the part of this system that isn't going to work long term.
I think there's probably a platform for it that you can open source --- the virtual machine, or the core of the virtual machine or something, but yeah, you're right, this is something Anubis will have to contend with long term; the effective solutions for this all benefit from obscurity.
I have a S24+ and Anubis often runs poorly for me and fails. I tend to frequent tech related sites so browsing on my phone has been miserable the last couple months.
I checked the value of navigator.hardwareConcurrency on my phone and it returns 9... I guess that explains it.
It looks like setting light performance mode in device optimisations (I don't game on my phone) turns off the S24s sole Cortex-X4.
Can I ask what hardware you’re using? I’ve heard similar things on the internet generally, but I’m on a several-years-old phone and it took under a second. Is the interstitial really that slow on some setups?
An AMD Ryzen 9 7950x3D has 16 physical cores and 32 logical cores. The diminisshing return for thread counts above cores / 2 is likely due to using the logical core count, not the physical core count, as SMT doesn't improve every type of performance. It's not the fault of Firefox, but an aspect of the CPU design.
Is it just me or does dividing an integer always turn on some alarm bells in my head?
I'd immediately look into what happens for odd numbers, rounding, implicit type conversions etc. Or at least that's what I was taught when I first started programming.
Also relying on "well we know that X is always Y" is almost always a mistake; maybe not always at first but definitely in the future because X will almost certainly be Y at some point. Defensive coding would catch such issues (with at the very least an Assert somewhere to ensure X is indeed Y before continuing, ensuring that we get a nice error when that assumption proves to be wrong).
In their testing, even with odd numbers of physical cores, SMT caused an even number of logical cores. Some phones didn't have SMT, and also had an odd number of physical cores, but this was genuinely rare.
Also, they still might not (but probably learned). In this article they imply that each type of CPU core (what they call a "tier" in the article) will still be a power of two, and one just happened to be 2^0. I'm not sure they were around when the AMD Athlon II X3 was hot.
>>> Today I learned this was possible. This was a total "today I learned" moment. I didn't actually think that hardware vendors shipped processors with an odd number of cores, however if you look at the core geometry of the Pixel 8 Pro, it has three tiers of processor cores. I guess every assumption that developers have about CPU design is probably wrong.
The Wii U and Xbox 360 were also triple core machines... both triple core powerpc processors with ATI graphics... Was IBM having a sale on 3 core ppc hardware that year?
I never thought about it before but I actually had to look up die shots to make sure they were not the same processor. and if I can trust the internet they are not. Hell I had to confirm that yes the playstation 3(also ppc, queue x-files theme) only had the one core and it's screwball subprocessers like I remembered.
> each type of CPU core (what they call a "tier" in the article) will still be a power of two
Yeah that's obviously not true, and believing it shows a marked lack of experience in the field. Of the current Xeon workstation lineup, only 3 of 14 SKUs have power-of-2 core counts. And there are consumer lines of CPUs with 6 cores and that sort of thing.
The line of code in the article is `Math.max(nproc / 2, 1)`. So 1 core yields 1 thread. Only CPUs with an odd number of cores, no SMT, and >1 core will hit this bug. Not very common
In theory a CPU with SMT could still trigger this bug, because not every core necessarily has to have SMT. Intel made some chips that combined performance cores with SMT and efficiency cores without SMT, so if they had an odd number of E-cores they'd have ended up with an odd number of threads regardless.
SMT generally caused single-core CPU's to appear as 2 logical cores.
I realize Anubis was probably never tested on a true single-core machine. They are actually somewhat difficult to find these days outside of microcontrollers.
Even in microcontrollers it is starting to become increasingly rare! We've progressed to a point where sub-$1 hobbyist chips like the RP2040 are multicore these days.
How come "let's use some cool cryptography to encrypt error messages" is being considered before "let's use a strongly typed language that even web developers are starting to become fond of" as a way to prevent issues in the future?
It does two things: Force everyone (including scrapers) to run a real JS engine, and force everyone to solve the challenge.
The first effect is great, because it's a lot more annoying to bring up a full browser environment in your scraper than just run a curl command.
But the actual proof of work only takes about 10ms on a server in native code, while it can take multiple seconds on a low-end phone. Given the companies in questions are building entire data centers to house all their GPUs, an extra 10ms per web-page is not a problem for them. They're going to spend orders of magnitude more compute actually training on the content they scraped, than solving the challenge.
It's mostly the inconvenience of adapting to Anubis's JS requirements that held them back for a while, but the PoW difficulty mostly slowed down real users.
Without getting into the alternatives: scraper defense isn't a viable proof of work setting, because there's no asymmetry to exploit. You're imposing exactly the same cost on legit users as you are on scrapers. Economies of scale mean that the marginal cost for your adversary is actually significantly lower than for your real users.
What the Anubis POW system is doing right now is exploiting the fact that there's been no need for crawlers to be anything but naive. But the cost to make them sophisticated enough to defeat the POW system is quite low, and when that happens, the POW will just be annoying legit users for no benefit.
I don't know if "mistake" is the word I'd use for it. It's not a whole lot of code! It's a reasonable first step to force crawlers to emulate a tiny fraction of a real browser. But as it evolves, it should evolve away from burning compute, because that's playing to lose.
Wait, but there is an asymmetry. Legitimate user spends at least a dozen seconds on a page, they don't care about 10ms overhead. For a scraper, however, 10ms overhead can easily be 10x the time it spends on a page overall - the scraper is now ten times slower.
However the exact PoW implementation (hash) chosen by Anubis might significantly reduce this asymmetry, because the calculation speed is highly dependent on hardware.
> Legitimate user spends at least a dozen seconds on a page, they don't care about 10ms overhead.
Unfortunately for the user on a low-end phone, the overhead can be several seconds. For the scraper it's only ever 10ms because that's running on a (relatively) powerful server CPU.
No, I don't think this is accurate. You have to look at both the cost and the benefit. If you're an AI scraper, it's literally just "what does the marginal next token of training data cost me" --- the answer is: the same as the marginal next token of content costs a reader.
Tavis Ormandy went into more detail on the math here, but it's not great!
The next word is worth less to AI scrapers than to human readers - AIs need to read thousands of articles to get as much value as a human gets from one good article. If you make it cost, say, 5c-equivalent to read an article (but without the overhead of micropayments and authorisations), human readers will happily pay that whereas AI scrapers can't afford even 1c-equivalent.
I don’t understand what you mean. Training an LLM requires orders of magnitude more tokens than any one human will ever read. Perhaps an AI company can amortize across all their users, but it would still represent a substantial cost. And I’m pretty sure the big AI companies don’t rely on abusive scraping (i.e. ignoring robots.txt), so the companies doing the scraping may not have a lot of users anyway.
Tavis Ormandy's post goes into more detail about why this isn't a substantial cost for AI vendors. For my part: we've seen POWs deployed successfully in cases where:
(1) there's a sharp asymmetry between adversaries and legitimate users (as with password hashes and KDFs, or antiabuse systems where the marginal adversarial request has value ~reciprocal to what a legit users gets, as with brute-forcing IDs)
(2) the POW serves as a kind of synchronization clock in a distributed system (as with blockchains)
An unavoidable aspect of abuse problems is that there is no perfect solution. As the defender, you’re always making a precision vs. recall tradeoff. After you’ve picked off the low hanging fruit, most of the time the only way to increase recall (i.e. catch more abuse) is by reducing the precision (i.e. having more false positives, where a good user is falsely considered an abuser).
In an adversarial engineering domain neither the problems or solutions are static. If by some miracle you have a perfect solution at one point in time, the adversaries will quickly adapt, and your solution stops being perfect.
So you’ll mostly be playing the game in this shifting gray area of maybe legit, maybe abusive cases. Since you can’t perfectly classify them (if you could, they wouldn’t be in the gray area), the options are basically to either block all of them, allow all of them, or issue them a challenge that the user must pass to be allowed. The first two options tend to be unacceptable in the gray area, so issuing a challenge that the client must pass is usually the preferred option.
A good counter-abuse challenge is something that has at least one of the following properties:
1. It costs more to pass than the economic value that the adversary can extract from the service, but not so much that the legitimate users won’t be willing to pay it.
2. It proves control of a scarce resource without necessarily having to spend that resource, but at least in such a way that the same scarce resource can’t be used to pass unlimited challenges.
3. It produces additional signals that can be used to meaningfully improve the precision/recall tradeoff.
And proof of work does none of those. The last two by construction, since compute is about the most fungible resource in the world. The last doesn't work since it's impossible to balance the difficulty factor such that it imposes a cost the attacker would notice but would be acceptable to the defender.
If you add 10s to the latency for your worst-case real users (already too long), it'll cost about $0.01/1k solves. That's not a deterrent to any kind of abuse.
So proof of work just is a really bad fit for this specific use case. The only advantage is that it is easy to implement, but that's a very short term benefit.
In practice, any automated work that a real user is willing to wait through will be trivial to accomplish for an organization which scrapes the entire Internet. The real weight behind Anubis is the Javascript gate, not the PoW. It might as well just fetch() into browser.cookies.set().
I think anubis bases its purpose on some flawed assumptions:
- that most scrapers aren't headless browsers
- that they don't have access to millions of different IPs across the world from big/shady proxy companies
- that this can help with a real network-level DDoS
- that scrapers will give up if the requests become 'too expensive'
- that they aren't contributing to warming the planet
I'm sure there does exist some older bots that are not smart and don't use headless browsers, but especially with newer tech/AI crawlers/etc., I don't think this is a realistic majority assumption anymore.
In part because this particular proof of work is absolutely trivial at scale, with commercial hardware able to do 390TH/s, while your typical phone would only be able to do a million and still have acceptable latency.
I think I actually saw a question on SO way back during the Windows Vista era when some guy asked if Windows supported machines with odd number of cores/processors, and the answer was "well, 1 is an odd number, you know".
Another joke from the same era: Having a 2 core processor means that you can now e.g. watch a film at the same time. At the same time with what? At the same time with running Windows Vista!
Sure, but 1 is also a power of 2:
2^0 = 1
So the logic might make sense in people's heads if they never encounter 6 or 12 core CPUs that are common these days.
Even long ago we had the AMD Phenom X3 chips which were 3 cores.
The fun thing about those is they were physically quad cores with one core disabled, which may or may not have been defective, so if you were lucky you could unlock it and get a bonus core for free.
Got a 4 core machine that way dirt cheap. Bought (a phenom II BE I think) 2 core cpu which unlocked into a quad core.
Binning made the world weird.
Xbox 360 (which ran a modified version of Win 2000) had 3 PowerPC cores.
TIL the CPU count is exposed to JS. I guess that's fine? It feels nasty, but it's not really worse than all the other fingerprinting data we expose...
Also fonts you have installed, the type of connection you're using, GPU parameters, keyboard languages on your system and so much more [1]
[1] https://abrahamjuliot.github.io/creepjs/
It’s also frequently wrong when running in Docker. Some of that is libuv’s fault, some of it is cgroups deciding not to mask off /proc values that are wrong in the cgroup.
I might be wrong but I think I heard that Firefox and maybe Safari bin that to a couple common values. (Or maybe that's just in the tracking-prevention mode that Tor uses?)
According to MDN Safari clamps it to 4 or 8, but Firefox does not: https://developer.mozilla.org/en-US/docs/Web/API/Navigator/h...
I always found it annoying that CPU information was widely available and precise while memory information was not - it's clamped to 0.25, 0.5, 1, 2, 4 or 8 GB. If you're running something memory-bound in the browser you have to be really conservative to avoid locking up the user's device (or ask them to manually specify how much memory to use). https://developer.mozilla.org/en-US/docs/Web/API/Device_Memo...
I wonder why he doesn't just set up a virtual machine with an odd number of vcpu's for testing.
> In retrospect implementing the proof of work challenge may have been a mistake and it's likely to be supplanted by things like Proof of React or other methods that have yet to be developed.
> ... a challenge method that requires the client to do a single round of SHA-256 hashing deeply nested into a Preact hook in order to prove that the client is running JavaScript.
Why a single round? Doing the whole proof of work challenge inside the proof of react would be even more effective, right?
It wouldn't be any more effective at determining that javascript is executing, and it would use a lot more resources on the client
The whole Anubis thing is a really interesting predicament for me.
I have Chrome on mobile configured as such that JS and cookies are disabled by default, and then I enable them per site based on my judgement. You might be surprised to learn that normally, this actually works fine, and sites are usually better for it. They stop nagging, and load faster. This makes some sense in retrospect, as this is what allows search engine crawlers to do their thing and get that SEO score going.
Anubis (and Cloudflare for that matter) force me to temporarily enable JS and cookies at least once however anyways, completely defeating the purpose of my paranoid settings. I basically never bother to, but I do admit it is annoying. It's kind of up there with sites that don't have any content by default, only with JS on (high profile example: AWS docs). At least Cloudflare only spoils the fun every now and then. With Anubis, it's always.
It's definitely my fault, but at the same time, I don't feel this is right. Simple static pages now require allowing arbitrary code execution and statefulness. (Although I do recognize that SVGs and fonts also kind of do so anyhow, much to my further annoyance).
> I have Chrome on mobile configured as such that JS and cookies are disabled by default
My God, there's two of us!
(Though … you're being privacy conscious on Chrome? Come to Firefox. Ignore the pesky "it's funded by Google" problems, nothing to see, nothing to see, the water is fiiiine.)
> You might be surprised to learn that normally, this actually works fine
I guess I have a different experience there. A huge number of sites just outright crash. (E.g., the HN search.) JavaScript devs, I've learned, do not handle error cases, and the exceptions tend to just propagated out and ruin the rendering. There seems to be some popular framework out there that even just destroys the whole DOM to emit just the error. (I forget the text, but it's the same text, always. Always centered. Flash of page, then crash.)
I have a custom extension that fakes the cookie storage for those JS pages that just lies & says "yeah, cookies are enabled" and the blackholes the writes. But it fails for anything that needs a real cookie … like Anubis.
I'm empathetic towards where Anubis is coming from though. But the "I passed the challenge" cookie is indistinguishable from a tracker … although probably most people running Anubis are inherently trustworthy by a sort of cultural association so long as Anubis remains non-mainstream. I think I might modify it to have the ability to store cookies for a short time frame (like 1h) in some cases, such as Anubis; that's enough to pass the challenge, but weighed against tracking. I'm usually only blocked by Anubis for something like a blog post, so that should suffice.
It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.
Where I work our main product is a React-based web site with a JSON back end, you might go to
http://example.com/web/item/88841
and that will load maybe 20MB of stuff (always the same thing) and eventually after the JS boots up a useEffect() gets called that reads '88841' out of the URL and does a GET to
http://example.com/api/item/88841
which gets you nicely formatted JSON. On top of that the public id(s) are sequential integers so you could easily enumerate all the items if you just thought a little bit.
We've had more than one obnoxious crawler that we had reason to believe was targeted specifically at us that would go to the /web/ URL and, without a cache, download all the HTML, Javascript, CSS, then run the JS and download the JSON for each page -- at which case they are either saving the generated HTML or looking at the DOM. If they'd spent 10 minutes playing with the browser dev tools they would have seen the /item/ request and probably could have figured out pretty quickly out how to interpret the results. As is they're going to have to figure out how to parse that HTML and turn it into something like the JSON and could probably save them 95% of the bandwidth, 95% of the CPU, and whatever time they spent writing parsing code and managing their Rube Goldberg machine but I'd take 50% odds any day that they never actually did anything with the data they captured because crawlers usually don't.
I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once. [1] (There's a running gag in my pod that I can't visit the state of Delaware because of my webcrawling)
[1] Ok, sometimes the way you avoid trouble is hit slow, hit soft, but still hit once. It's a judgement call if you can hit them before they knew what hit them or if you can blend in with the rest of the traffic.
> I know because I've done more than my share of web crawling and I have crawlers that: capture plain http data, can run Javascript in a limited way, and can run React apps. The last one would blast right past Anubis without any trouble except for the rate limiting which is not a lot of problem because when I crawl I hit fast, I hit hard, and I crawl once.
I have no problem with bots scraping all my data, I have a problem with poorly-coded bots overloading my server, making it unusable for anybody else. I'm using Anubis on the web interface to an SVN server, so if the bots actually wanted the data, they could just run "svn co" instead of trying to scrape the history pages for 300k files.
> It seems like a whole lot of crap to me. Hostile webcrawlers, not to mention Google, frequently run Javascript these days.
I'm also rather unhappy that I had to deploy Anubis, but it's unfortunately the only thing that seemed to work, and the server load was getting so bad that the alternative was just disabling the SVN web interface altogether.
It's definitely not your fault. Don't give in to the monopoly of Big Browser and mass surveillance.
Incidentally, I read a short while ago that not having "Mozilla" in your user-agent will bypass Anubis, so give that a try.
Stock mobile Chrome doesn't support extensions and doesn't have user agent manipulation capabilities unfortunately, so that's a no-go.
My options are using custom Chrome, migrating to Firefox, or proxying my traffic and making edits that way (e.g. doing the Anubis PoW there and injecting the cookie required).
Not stoked about any of these, although Firefox is a lot on my mind these days, and option #3 would be a good excuse to dust off my RPi.
Firefox's sandboxing still wasn't anywhere near as robust as Chrome's from a security perspective last time I checked, but fwiw, FF mobile has full FF extension support these days, including full-fat uBlock Origin.
I refuse to be boiled slowly by Google. With MV3, it was full-fat ad blockers. With MV4 it could very well be ALL ad-blockers.
And yeah, I concede that sounds conspiratorial - as conspiratorial as Google cracking down on your ability to run the ad-blocker of your choice would've sounded a decade ago.
If you are doing JS whitelisting then in terms of security, you're already far ahead of everyone else who just has it on by default (and isn't blocking anything either.)
anubis-bypass uses the User-Agent modification approach:
https://gitlab.com/zipdox/anubis-bypass
You say paranoid, I say sensible. My browsers are configured almost the same way. (I'm fine with temporarily enabling cookies, but scripts are unwelcome.)
Anubis has become an annoying denial-of-service layer in front of sites that I would otherwise use. I hope its no-script mode gets enabled by default soon.
I just bounce off those sites most of the time. Whatever, there’s still a lot of open internet.
We have nothing to protect sites against scrapers except to make it more expensive for everyone’s, unless privacy-compromising or authority-trusting methods are on the table.
Making you pay time, power, bandwidth, or money to access content does not significantly impede your browsing, so long as the cost is appropriately small. For the user above reporting thirty seconds of maxcpu, that’s excessive for a median normal person (but us hackers are not that).
If giving your unique burned-in crypto-attested device ID is acceptable, there’s an entire standard for that, and when your device is found to misbehave, your device can be banned. Nintendo, Sony, Xbox call this a “console ban”; it’s quite effective because it’s stunningly expensive to replace a device.
If submitting proof of citizenship through whatever attestation protocol is palatable is okay, the Anubis could simply add the digital ID web standard and let users skip the proof of work in exchange for affirming that they have a valid digital ID. But this only works if your specific identity can be banned, or else AI crawlers will just send a valid anonymized digital ID header.
This problem repeats in every suggested outcome: either you make it more difficult for users to access a site, or you require users to waste energy to access a site, or you require identifiable information signed by a dependable third-party authority to be presented such that a ban is possible based on it. IP addresses don’t satisfy this; Apple IDs, trusted-vendor HSM-protected device identifiers, and digital passports do satisfy this.
If you have a solution that only presents barriers to excessive use and allows abusive traffic to be revoked without depending on IP address, browser fingerprint, or paid/state credentials, then you can make billions of dollars in twelve months.
Ideas welcome! This has been a problem since bots started scraping RSS feeds and republishing them as SEO blogs, and we still don’t have a solution besides Cloudflare and/or CPU-burning interstitials.
(ps. I do have a solution for this, but it would require physical builds, be mildly unprofitable over time with no growth potential, and incite governments hostility towards privacy-preserving identity systems. A billionaire philanthropist could build it in a year and completely solve this problem. Sigh.)
I actually do not have a problem with digital IDs, as long as my personal identity isn't being shared alongside it. Not to the site operator, not to the government.
This might seem contradictory, but I believe this is technically possible? What I don't think is this is how these solutions actually work currently. Like to basically prove that I am indeed a unique visitor who's a person according to the govt, but wouldn't reveal the person info to the site, and wouldn't reveal the site info to the govt, even if they collude.
Same with the whole +18 goof. I'd actually quite like to try age gated communities, like +-5 years my age. I feel a lot of conflict stems from people coming from a bit too different walks of life sometimes. Could even do high confidence location based gating this way, which could also be cool (as well as the exact opposite of cool, because of course).
The core of the identity problem is, “how can someone ban you personally without being able to identify you?”, and so far as I know, “a trusted third-party checks your identity and issues you an identifier” is the best we’ve got - but of course those identifiers can be enriched, so you end up needing a single centralized third-party that can issue identities and also honor bans for “the individual behind identity X” from specific sites by site request, while being audit-proven to actually enforce those bans upon me regardless of how many anonymous identities I choose to generate and use. (If you don’t run this as a monopoly, either they all pool their user-operator banned identities lists to prevent ban evasion, which will eventually leak, or they will be compelled to someday later down the road when they lose the inevitable monopoly lawsuits.)
It’s not difficult to solve this problem — the database schema and queries are dead simple! — it’s just exceedingly difficult to succeed if you're not a passport-issuing entity or an authorized monopoly of such.
I wrestle with this by just separating this concern.
In the model I described, the trust anchor would be the govt, so basically a centralized model like domain certs. This resolves the issues you list off, but brings others: what if the trust anchor isn't trustworthy and starts forging identities?
The alternative to that would then be web of trust stuff. But this is why I consider this to be a separate problem. If the core protocol could be laid out and standardized at least, then layering on another that makes this choice between centralized vs web of trust could be done separately.
https://www.w3.org/TR/vc-overview/
Specifically, only something government sized is trustworthy enough to not plainly sell your data later... Or perhaps get it stolen.
Scratch that, this happens all the time. With a third party there's no way to revoke, government you can usually physically handle this.
Assuming a person can only have a single ID, how would that be enforced without a unique party having a 1-to-1 mapping between person and ID?
Why assume that?
Person W is welcome to have thousands of unique IDs if they want to, so long as when site X bans identity Y, that ban is applied to all of Person W’s present and future identities. Whether W has a single Y or a thousand Y makes no difference to me. I suppose some sites will care to restrict participation to a single Y per W, but e.g. in the general browsing a site with crawler/bot/AI shielding such as Anubis today, it’s completely irrelevant to them what your Y is so long as rate limits and bans apply to all Y of W rather than to the presented Y alone.
The party to hold a gov id - site id mapping would be you. The rest can then be facilitated using zero-knowledge proofs, I believe.
I'm not super well-versed in crypto though, so I confess this is a lot more conjecture than knowledge.
So the user generates the ID for each site he visits? What prevents them from generating arbitrary IDs?
The only way I can imagine this working is:
1. You go to the government and request to have a digital ID generated.
2. The government generates a random number.
3. The government issues a request to an NGO to generate a new cryptographic object based on the random number, and receives back a retrieval number.
4. The government gives you the retrieval number, which you can use to get your digital ID from the NGO.
This way, the government only has the mapping between your identity and a random number, and the NGO only has the mapping between the random number and the generated object, with no possibility to deanonymize it because you don't present any ID to get it. Obviously, there must be no information exchange between the government and the NGO.
> So the user generates the ID for each site he visits? What prevents them from generating arbitrary IDs?
The construction would go basically like this:
pseudonym = VRF(secret_key + site_id)
The expectation is that you would have only one valid secret_key at any time, and it would be unknown to the government. This kind of scheme is called anonymous credential generation in literature I believe. It can be established the secret_key got govt backed, but that's it.
The site_id would be e.g. domain cert public key or similar (domain ownership is a moving target, so just the domain name imo is not sound).
VRF is a verifiable random function. This is the magic ZK part.
Pseudonym is what you present to the site, i.e. the identity you go by.
This way the site can verify that this pseudonym was specifically issued for it (making it site unique), and that it belongs to a govt certified identity (of which there should be only one issued at a time per person). The VRF is deterministic, guaranteeing that it's the same person every time.
Revocation is annoying so I didn't bother thinking that through but should be fairly okay I think?
I believe this is robust to people forging arbitrary IDs, to sites colluding with each other in deanonymization, and colluding with the govt in the same. The only kickers I can think of are secret_key misuse (e.g. via duress) / theft / loss / sharing, and the trust anchor (the govt) being untrustworthy (forging invalid or duplicate identities). Would also need to handle people dying, but that would be pretty much just revocation.
I consider trust anchor issues out of scope. The remainder doesn't sound too bad to try defending for, and I think is also basically out of scope.
Potentially important edit: I'm not accounting for timing side channels here, which might be relevant during revocation or else.
Another: didn't mention but in my humble opinion cryptographically attesting people is unsound. People can't calculate crypto in their head, and can't recall long arbitrary strings of hex. What is appropriate to attest (if anything) is their devices instead. But that's a layer of complication I didn't want to deal with here.
>The expectation is that you would have only one valid secret_key at any time
Why, though? If you're the only one who knows it, nothing prevents you from creating as many identities for the same site as you wish.
It's an interactively generated thing, so the govt can ensure you can only complete it once, while being ignorant of its content. Or at least that's the claim of these protocols (e.g. Camenisch–Lysyanskaya (CL) signatures) afaik. I'm not sure how they work in detail.
> We have nothing to protect sites against scrapers except to make it more expensive for everyone’s (...)
Authentication works, doesn't it?
Paid credentials work, yes.
Free credentials temporarily do, but only until someone teachers crawlers to AI-automate signups. Forum spammers already figured this out a long time ago.
A while back someone from Financial Times commented on HN about how it was confusing that we were all so hostile to their free registration required article paywall. The HN community view on their registration requirement suggests that free authentication does not work and has not for some time; has that viewpoint changed in recent years?
> so long as the cost is appropriately small.
there are different metrics for cost, however. Based on cpu utilization and/or time, it's hard to argue that Anubis is a high price.
But if it is important to you to not run javascript for whatever reason, the price of access to a site using Anubis is rather high.
No, no, no, hell fucking no!!
You put stuff on the public Internet, expect it to be read by everyone.
Don't like that? Put it behind a login.
How did the propaganda persuade people into accepting mass surveillance and normalising the invasion of privacy for something that was never really a problem?
I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
What a coincidence that "identity verification" became a hot topic recently.
Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
> What a coincidence that "identity verification" became a hot topic recently.
Crying “Conspiracy” in reply to a career Chicken Little is comedic. I’ve been raising warnings about identity verification looming on the horizon for perhaps fifteen years now; thanks to DejaNews for that early realization, I suppose.
> Don't even try to solve this nonexistent "problem", because that will only drive us further into authoritarian technocracy.
I would celebrate and tell all my friends if someone on this thread, on any thread, would explain how we solve this without bankrupting non-business site operators and without a third-party authority. Anubis is a band-aid at best, yet no better solution — not even an idea — is presented alongside your objections.
> You put stuff on the public Internet, expect it to be read by everyone.
My hobbyist forum can barely stay online eight hours a day due to crawler traffic. Someone scraped the entire site by spawning one request per page with no fork limit last year. It was down for a solid week after that, and now has very severe limits in place. I don’t know how they can afford to stay running, but certainly “static only” isn’t going to solve the CPU and bandwidth costs incurred by incompetent and redundant AI crawlers. So, by making their site public in today’s infested internet, their content is no longer accessible.
> Don't like that? Put it behind a login.
As I noted above, one solution is payment — since free credentials registration is not an obstacle to AI bots, after all. For some reason people don’t like to charge money for hobbyist content if they can avoid it. I recognize why and am trying my best to discover a non-monetary solution on their behalf.
> I suspect the ones pushing this are the ones running these "crawlers" themselves, much like Cloudflare hosts providers of DDoS services.
I do not, have not, and will not run crawlers or AI agents, trainers, or other such shit at any time in the past thirty years and will continue to abstain from the entire category, which should be quite easy as I’m a retired sysop now attending full-time accounting school and giving a finger to the entire industry to pursue work that benefits humanity. Same reason I bother telling HN “the anonymity sky is falling” every so often: I’d much prefer it if we didn’t have to sacrifice anonymity online to defeat scraper bots.
> No, no, no, hell fucking no!!
Please find a way to turn your vehemence and passion into a productive contribution, before it’s too late for all of us. As presented, your argument is neither supported nor persuasive, and your hostility only gives opponents of anonymity more arrows in their quiver to shoot at us.
We very definitely do have stuff to protect sites that don't make it more expensive for everyone! Just none of it is open source.
And, fundamentally it can't be opensource. Bot detection (like anti-fraud more generally) is an adversarial game that relies on hidden techniques. Open-sourcing it means you lose that advantage and make life much easier for anyone trying to get around it.
There's zero reason it cannot be opensource. Proof-of-XXXXX schemes do not rely on obscurity to be functional.
The schemes large players use to increase the cost of e.g. creating new accounts on their services do in fact rely on obscurity. They target developer cost, not compute cost.
But that's not what Anubis (the subject of TFA, and most of the comment threads here) is/are about.
I think it's exactly what Anubis is about? I'm pretty familiar with what Xe is doing.
...You're computing a hash, not making accounts or anything else that relies on obscurity though?
You're computing hash in order to block automated web requests. The hash isn't the point --- the hash cost is the part of this system that isn't going to work long term.
I think there's probably a platform for it that you can open source --- the virtual machine, or the core of the virtual machine or something, but yeah, you're right, this is something Anubis will have to contend with long term; the effective solutions for this all benefit from obscurity.
SQRL solves at least part of the ID problem.
I have a S24+ and Anubis often runs poorly for me and fails. I tend to frequent tech related sites so browsing on my phone has been miserable the last couple months.
I checked the value of navigator.hardwareConcurrency on my phone and it returns 9... I guess that explains it.
It looks like setting light performance mode in device optimisations (I don't game on my phone) turns off the S24s sole Cortex-X4.
Sometimes cores are fractional. Particularly thanks to Docker. I’m currently trying to get this fixed in several NodeJS situations.
If I must enable JS for your site to do cryptographic, well, maybe I'll live without your site. Seriously, defend yourself some other way.
Ironically, this sat on the intermission page for a good half-minute while my fans spun up. Then I gave up; it was eating the battery.
Can I ask what hardware you’re using? I’ve heard similar things on the internet generally, but I’m on a several-years-old phone and it took under a second. Is the interstitial really that slow on some setups?
I do a lot of random browsing on an old iPad. Which doesn't have fans, I know, that was short for "it got really hot".
I'm not sure what generation it is, but I bought it around a decade ago I think.
Old browsers without crypto support would fall back to pure js sha256 implementation, which I imagine would be slow on an old iPad.
An AMD Ryzen 9 7950x3D has 16 physical cores and 32 logical cores. The diminisshing return for thread counts above cores / 2 is likely due to using the logical core count, not the physical core count, as SMT doesn't improve every type of performance. It's not the fault of Firefox, but an aspect of the CPU design.
Assuming core count to be even seems pretty oblivious to me.
Is it just me or does dividing an integer always turn on some alarm bells in my head?
I'd immediately look into what happens for odd numbers, rounding, implicit type conversions etc. Or at least that's what I was taught when I first started programming.
Also relying on "well we know that X is always Y" is almost always a mistake; maybe not always at first but definitely in the future because X will almost certainly be Y at some point. Defensive coding would catch such issues (with at the very least an Assert somewhere to ensure X is indeed Y before continuing, ensuring that we get a nice error when that assumption proves to be wrong).
Wait, the Anubis people _didn't know_ 3 core machines were sold for years? AMD was famous for it!
In their testing, even with odd numbers of physical cores, SMT caused an even number of logical cores. Some phones didn't have SMT, and also had an odd number of physical cores, but this was genuinely rare.
Also, they still might not (but probably learned). In this article they imply that each type of CPU core (what they call a "tier" in the article) will still be a power of two, and one just happened to be 2^0. I'm not sure they were around when the AMD Athlon II X3 was hot.
>>> Today I learned this was possible. This was a total "today I learned" moment. I didn't actually think that hardware vendors shipped processors with an odd number of cores, however if you look at the core geometry of the Pixel 8 Pro, it has three tiers of processor cores. I guess every assumption that developers have about CPU design is probably wrong.
The Wii U and Xbox 360 were also triple core machines... both triple core powerpc processors with ATI graphics... Was IBM having a sale on 3 core ppc hardware that year?
I never thought about it before but I actually had to look up die shots to make sure they were not the same processor. and if I can trust the internet they are not. Hell I had to confirm that yes the playstation 3(also ppc, queue x-files theme) only had the one core and it's screwball subprocessers like I remembered.
Interestingly, the single PPE core in the PS3 and the 3 cores in the Xbox 360, are pretty much the same core.
When AMD shipped their X3 CPUs I'm pretty sure they didn't support Hyper Threading either.
The current Apple TV has 5 cores. No web browser though.
11th gen iPad uses a 5-cores A16. The 14” M3Pro MBP exists in 11 cores (MRX33 / MRX63). There’s also a 9 cores M4, used on the iPad Pro.
> each type of CPU core (what they call a "tier" in the article) will still be a power of two
Yeah that's obviously not true, and believing it shows a marked lack of experience in the field. Of the current Xeon workstation lineup, only 3 of 14 SKUs have power-of-2 core counts. And there are consumer lines of CPUs with 6 cores and that sort of thing.
I believe that the assumption was multiple of two, not power of two.
Yeah I both terribly mistyped and misrepresented Anubis' assumption. I'm sorry for that error.
What about... single-core machines?
The line of code in the article is `Math.max(nproc / 2, 1)`. So 1 core yields 1 thread. Only CPUs with an odd number of cores, no SMT, and >1 core will hit this bug. Not very common
In theory a CPU with SMT could still trigger this bug, because not every core necessarily has to have SMT. Intel made some chips that combined performance cores with SMT and efficiency cores without SMT, so if they had an odd number of E-cores they'd have ended up with an odd number of threads regardless.
You can also just boot linux with maxcpus=5 or any other number. Believing things about the parity of the number of CPUs is just nuts.
SMT generally caused single-core CPU's to appear as 2 logical cores.
I realize Anubis was probably never tested on a true single-core machine. They are actually somewhat difficult to find these days outside of microcontrollers.
Even in microcontrollers it is starting to become increasingly rare! We've progressed to a point where sub-$1 hobbyist chips like the RP2040 are multicore these days.
How come "let's use some cool cryptography to encrypt error messages" is being considered before "let's use a strongly typed language that even web developers are starting to become fond of" as a way to prevent issues in the future?
> I guess every assumption that developers have about CPU design is probably wrong.
Javascripters, perhaps. Those who work on schedulers, or kernels in general would find this completely normal
The only bug is the use of a cursed language such as JavaScript!
> In retrospect implementing the proof of work challenge may have been a mistake
Why?
What would the alternative have been?
It does two things: Force everyone (including scrapers) to run a real JS engine, and force everyone to solve the challenge.
The first effect is great, because it's a lot more annoying to bring up a full browser environment in your scraper than just run a curl command.
But the actual proof of work only takes about 10ms on a server in native code, while it can take multiple seconds on a low-end phone. Given the companies in questions are building entire data centers to house all their GPUs, an extra 10ms per web-page is not a problem for them. They're going to spend orders of magnitude more compute actually training on the content they scraped, than solving the challenge.
It's mostly the inconvenience of adapting to Anubis's JS requirements that held them back for a while, but the PoW difficulty mostly slowed down real users.
I'd have thought that anyone doing serious scraping has been driving a browser for years at this point.
you can even get a curl fork that drives a browser under the hood
It does 2.5 things actually - it counters the aggressive use of proxies by scrapers.
Without getting into the alternatives: scraper defense isn't a viable proof of work setting, because there's no asymmetry to exploit. You're imposing exactly the same cost on legit users as you are on scrapers. Economies of scale mean that the marginal cost for your adversary is actually significantly lower than for your real users.
What the Anubis POW system is doing right now is exploiting the fact that there's been no need for crawlers to be anything but naive. But the cost to make them sophisticated enough to defeat the POW system is quite low, and when that happens, the POW will just be annoying legit users for no benefit.
I don't know if "mistake" is the word I'd use for it. It's not a whole lot of code! It's a reasonable first step to force crawlers to emulate a tiny fraction of a real browser. But as it evolves, it should evolve away from burning compute, because that's playing to lose.
Wait, but there is an asymmetry. Legitimate user spends at least a dozen seconds on a page, they don't care about 10ms overhead. For a scraper, however, 10ms overhead can easily be 10x the time it spends on a page overall - the scraper is now ten times slower.
However the exact PoW implementation (hash) chosen by Anubis might significantly reduce this asymmetry, because the calculation speed is highly dependent on hardware.
> Legitimate user spends at least a dozen seconds on a page, they don't care about 10ms overhead.
Unfortunately for the user on a low-end phone, the overhead can be several seconds. For the scraper it's only ever 10ms because that's running on a (relatively) powerful server CPU.
I don't know of any network latency <=1ms over the public internet, so 10ms overhead might be 2x at best.
No, I don't think this is accurate. You have to look at both the cost and the benefit. If you're an AI scraper, it's literally just "what does the marginal next token of training data cost me" --- the answer is: the same as the marginal next token of content costs a reader.
Tavis Ormandy went into more detail on the math here, but it's not great!
The next word is worth less to AI scrapers than to human readers - AIs need to read thousands of articles to get as much value as a human gets from one good article. If you make it cost, say, 5c-equivalent to read an article (but without the overhead of micropayments and authorisations), human readers will happily pay that whereas AI scrapers can't afford even 1c-equivalent.
I don’t understand what you mean. Training an LLM requires orders of magnitude more tokens than any one human will ever read. Perhaps an AI company can amortize across all their users, but it would still represent a substantial cost. And I’m pretty sure the big AI companies don’t rely on abusive scraping (i.e. ignoring robots.txt), so the companies doing the scraping may not have a lot of users anyway.
Tavis Ormandy's post goes into more detail about why this isn't a substantial cost for AI vendors. For my part: we've seen POWs deployed successfully in cases where:
(1) there's a sharp asymmetry between adversaries and legitimate users (as with password hashes and KDFs, or antiabuse systems where the marginal adversarial request has value ~reciprocal to what a legit users gets, as with brute-forcing IDs)
(2) the POW serves as a kind of synchronization clock in a distributed system (as with blockchains)
What's case (3) here?
An unavoidable aspect of abuse problems is that there is no perfect solution. As the defender, you’re always making a precision vs. recall tradeoff. After you’ve picked off the low hanging fruit, most of the time the only way to increase recall (i.e. catch more abuse) is by reducing the precision (i.e. having more false positives, where a good user is falsely considered an abuser).
In an adversarial engineering domain neither the problems or solutions are static. If by some miracle you have a perfect solution at one point in time, the adversaries will quickly adapt, and your solution stops being perfect.
So you’ll mostly be playing the game in this shifting gray area of maybe legit, maybe abusive cases. Since you can’t perfectly classify them (if you could, they wouldn’t be in the gray area), the options are basically to either block all of them, allow all of them, or issue them a challenge that the user must pass to be allowed. The first two options tend to be unacceptable in the gray area, so issuing a challenge that the client must pass is usually the preferred option.
A good counter-abuse challenge is something that has at least one of the following properties:
1. It costs more to pass than the economic value that the adversary can extract from the service, but not so much that the legitimate users won’t be willing to pay it.
2. It proves control of a scarce resource without necessarily having to spend that resource, but at least in such a way that the same scarce resource can’t be used to pass unlimited challenges.
3. It produces additional signals that can be used to meaningfully improve the precision/recall tradeoff.
And proof of work does none of those. The last two by construction, since compute is about the most fungible resource in the world. The last doesn't work since it's impossible to balance the difficulty factor such that it imposes a cost the attacker would notice but would be acceptable to the defender.
If you add 10s to the latency for your worst-case real users (already too long), it'll cost about $0.01/1k solves. That's not a deterrent to any kind of abuse.
So proof of work just is a really bad fit for this specific use case. The only advantage is that it is easy to implement, but that's a very short term benefit.
In practice, any automated work that a real user is willing to wait through will be trivial to accomplish for an organization which scrapes the entire Internet. The real weight behind Anubis is the Javascript gate, not the PoW. It might as well just fetch() into browser.cookies.set().
They also suggest maybe “proof of React” would be better with a link to this rough proof of concept:
https://github.com/TecharoHQ/anubis/pull/1038
Could someone explain how this would help stop scrapers? If you’re just running the page JS wouldn’t this run too and let you through?
Low-effort scrapers don't run JS, they just fetch static content.
But then they couldn’t get past the current Anubis. Sonia the idea it would just be cheaper for clients?
That's the idea. Impose software requirements on the client instead of computational requirements.
They admitted that this was a 'shitpost'.
> how this would help stop scrapers
I think anubis bases its purpose on some flawed assumptions:
- that most scrapers aren't headless browsers
- that they don't have access to millions of different IPs across the world from big/shady proxy companies
- that this can help with a real network-level DDoS
- that scrapers will give up if the requests become 'too expensive'
- that they aren't contributing to warming the planet
I'm sure there does exist some older bots that are not smart and don't use headless browsers, but especially with newer tech/AI crawlers/etc., I don't think this is a realistic majority assumption anymore.
In part because this particular proof of work is absolutely trivial at scale, with commercial hardware able to do 390TH/s, while your typical phone would only be able to do a million and still have acceptable latency.