It seems that OpenAI is scraping [certificate transparency] logs

(benjojo.co.uk)

113 points | by pavel_lishin 5 hours ago ago

71 comments

827a 3 hours ago ago
Thousands of systems, from Google to script kiddies to OpenAI to nigerian call scammers to cybersecurity firms, actively watch the certificate transparency logs for exactly this reason. Yawn.
[-]
- H8crilA 3 hours ago ago
  For those that never looked at the CT logs: https://crt.sh/?q=ycombinator.com
  (the site may occasionally fail to load)
  [-]
  - 1vuio0pswjnm7 an hour ago ago
    Considering how it must be getting hammered what with the "AI" nonsense, it's interesting how crt.sh continues to remain usable, particularly the (limited) direct PostgresSQL db access
    To me, this is evidence that SQL databases with high traffic can be made directly accessible on the public internet
    crt.sh seems to be more accessible at certain times of the day. I can remember when it had no such accessibility issues
  - Eikon 2 hours ago ago
    Shameless plug :)
    https://www.merklemap.com/search?query=ycombinator.com&page=...
    Entries are indexed by subdomain instead of by certificate (click an entry to see all certificates for that subdomain).
    Also, you can search for any substring (that was quite the journey to implement so it's fast enough across almost 5B entries):
    https://www.merklemap.com/search?query=ycombi&page=0
    [-]
    - chuckadams an hour ago ago
      Any insights you can share on how you made search so fast? What kind of resources does it take to implement it?
- ekr____ 3 hours ago ago
  With that said, given that (1) pre-certificates in the log are big and (2) lifetimes are shortening and so there will be a lot of duplicates, it seems like it would be good for someone to make a feed that was just new domain names.
  [-]
  - agwa an hour ago ago
    There's an extension to static-ct-api, currently implemented by Sunlight logs, that provides a feed of just SANs and CNs: https://github.com/FiloSottile/sunlight/blob/main/names-tile...
    For example:
```
  curl https://tuscolo2026h1.skylight.geomys.org/tile/names/000 | gunzip
```
    (It doesn't deduplicate if the same domain name appears in multiple certificates, but it's still a substantial reduction in bandwidth compared to serving the entire (pre)certificate.)
  - 827a 2 hours ago ago
    These exist for apex domains; the real use-case is subdomains.
    [-]
    - ekr____ 2 hours ago ago
      Sure, but the subdomains will be duplicated for the same reasons.
  - Eikon 2 hours ago ago
    Merklemap offers that: https://www.merklemap.com/documentation/live-tail
- 1vuio0pswjnm7 2 hours ago ago
  "... for exacty this reason."
  Needs clarification. What reason
- raldi 2 hours ago ago
  What reason?
  [-]
  - electroly 2 hours ago ago
    The CT log tells you about new websites as soon as they come online. Good if you're intending to scrape the web.
- pavel_lishin 3 hours ago ago
  What's the yawn for?
  [-]
  - jfindper 2 hours ago ago
    It implies that this is boring and not article/post-worthy (which I agree with).
    Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting.
    [-]
    - pavel_lishin 2 hours ago ago
      > It implies that this is boring and not article/post-worthy (which I agree with).
      It's certainly news to me, and presumably some others, that this exists.
      [-]
      - jfindper 2 hours ago ago
        Which part is news?
        If certificate transparency is new to you, I feel like there are significantly more interesting articles and conversations that could/should have been submitted instead of "A public log intended for consumption exists, and a company is consuming that log". This post would do literally nothing to enlighten you about CT logs.
        If the fact that OpenAI is scraping certificate transparency logs is new and interesting to you, I'd love to know why it is interesting. Perhaps I'm missing something.
        Way more interesting reads for people unfamiliar with what certificate transparency is, in my opinion, than this "OpenAI read my CT log" post:
        https://googlechrome.github.io/CertificateTransparency/log_s...
        https://certificate.transparency.dev/
    - JumpCrisscross 2 hours ago ago
      > Certificate transparency logs are intended to be consumed by others. That is indeed what is happening. Not interesting
      Oh, I read this as indicating OpenAI may make a move into the security space.
      [-]
      - prettyblocks 2 hours ago ago
        Even if it's just for their internal security initiatives it would make sense given how massive they are. Threat hunting via cert monitoring is very effective.
  - moralestapia 2 hours ago ago
    Because it's hardly news in its context.
  - xpe 3 hours ago ago
    Presumably this is well-known among people that already know about this.
    P.S. In the hopes of making this more than just a sarcastic comment, the question of "How do people bootstrap knowledge?" is kind of interesting. [1]
    > To tackle a hard problem, it is often wise to reuse and recombine existing knowledge. Such an ability to bootstrap enables us to grow rich mental concepts despite limited cognitive resources. Here we present a computational model of conceptual bootstrapping. This model uses a dynamic conceptual repertoire that can cache and later reuse elements of earlier insights in principled ways, modelling learning as a series of compositional generalizations. This model predicts systematically different learned concepts when the same evidence is processed in different orders, without any extra assumptions about previous beliefs or background knowledge. Across four behavioural experiments (total n = 570), we demonstrate strong curriculum-order and conceptual garden-pathing effects that closely resemble our model predictions and differ from those of alternative accounts. Taken together, this work offers a computational account of how past experiences shape future conceptual discoveries and showcases the importance of curriculum design in human inductive concept inferences.
    [1]: https://www.nature.com/articles/s41562-023-01719-1
- irishcoffee 3 hours ago ago
  Everyone does it, it’s no big deal. “Yes officer I was speeding, so was everyone else!”
  Gross.
  [-]
  - jfindper 3 hours ago ago
    The intended purpose of certificate transparency logs is to be viewed by others!
    Perhaps you should save your "gross" judgement for when you better understand what's happening?
  - edvinbesic 3 hours ago ago
    You are implying that a law is being broken, but isn't this the equivalent of going to city hall to pull public land records?
  - tsimionescu 3 hours ago ago
    The whole point of the CT logs is to be a public list of all domains which have TLS certs issued by the Web PKI. People are reading this list. I really don't see what is either surprising or in any way problematic in doing so.
  - formerly_proven 3 hours ago ago
    The whole point of CT logs is to make issuance of certificates in the public WebPKI… public.
throwaway150 37 minutes ago ago
I don't understand the outrage in some of the comments. The certificate transparency logs are literally meant to be read by absolutely whoever wants to read them. The clue is right in the name. It's transparency logs! Transparency!
I just don't understand how people with no clue whatsoever about what's going on feel so confident to express outrage over something they don't even understand! I don't mind someone not knowing something. Everybody has to learn things somewhere for the first time. But couldn't they just keep their outrage to themselves and take some time to educate themselves, to find out whether that outrage is actually well placed?
Some of the comments in the OP are also misinformed or illogical. But there's one guy there correcting them so that's good. I mean I'd say that https://en.wikipedia.org/wiki/Certificate_Transparency or literally any other post about CT is going to be far more informative than this OP!
[-]
- prepend 14 minutes ago ago
  People are just raging and want an outlet. They aren’t thinking logically.
  It’s been going on forever (remember how companies were reading files off your computer aka cookies in 1999?)
  This seems like a total non-issue and expect that any public files are scraped by OpenAI and tons others. If I don’t want something scraped, I don’t make it public.
Aurornis 3 hours ago ago
This could be OpenAI, or it could be another company using their header pattern.
It has long been common for scrapers to adopt the header patterns of search engine crawlers to hide in logs and bypass simple filters. The logical next step is for smaller AI players to present themselves as the largest players in the space.
Some search engines provide a list of their scraper IP ranges specifically so you can verify if scraper activity is really them or an imitator.
EDIT: Thanks to the comment below for looking this up and confirming this IP matches OpenAI’s range.
[-]
- jsheard 3 hours ago ago
  In this case it is actually OpenAI, the IP (74.7.175.182) is in one of their published ranges (74.7.175.128/25).
  https://openai.com/searchbot.json
  I don't know if imitating a major crawler is really worth it, it may work against very naive filters, but it's easy to definitively check whether you're faking so it's just handing ammo to more advanced filters which do check.
```
  $ curl -I https://www.cloudflare.com
  HTTP/2 200

  $ curl -I -H "User-Agent: Googlebot" https://www.cloudflare.com
  HTTP/2 403
```
  [-]
  - btown an hour ago ago
    I don't have a statistic here, but I'm always surprised how many websites I come across that do limited user-agent and origin/referrer checks, but don't maintain any kind of active IP based tracking. If you're trying to build a site-specific scraper and are getting blocked, mimicking headers is an easy and often sufficient step.
  - Aurornis 3 hours ago ago
    Thanks for looking it up!
bombcar 3 hours ago ago
If you somewhat want to avoid this, get a wildcard certificate (LE supports them: https://community.letsencrypt.org/t/acme-v2-production-envir...
Then all they know is the main domain, and you can somewhat hide in obscurity.
[-]
- bityard an hour ago ago
  Yep, but this comes with a tradeoff: all of your services now have a valid key/cert for your whole domain, significantly increasing the blast radius if one service is compromised.
- vault 2 hours ago ago
  Correct, that's what I did with caddy, which is now periodically renewing my wildcard certificate through a DNS-01 challenge.
  [-]
  - 8cvor6j844qw_d6 an hour ago ago
    May I know does Caddy automatically update with apt if you built custom Caddy binaries for the DNS provider plugin?
    Also, may I know which DNS provider you went with? The GitHub issues pages with some of the DNS provider plugins seems to suggest some are more frequently maintained, while some less so.
- lysace 3 hours ago ago
  Unfortunately they are a bit extra bothersome to automate (depending on your DNS provider/setup) because of the DNS CNAME-method validation requirement.
  [-]
  - jsheard 2 hours ago ago
    Yep, but next year they intend to launch an alternative DNS challenge which doesn't require changing DNS records with every renewal. Instead you'll create a persistent TXT record containing a public key, and then any ACME client which has the private key can keep requesting new certs forever.
    https://letsencrypt.org/2025/12/02/from-90-to-45#making-auto...
    [-]
    - Ajedi32 2 hours ago ago
      Oh, sweet! I didn't know about this. I have no need of wildcard certs, but this will greatly simplify the process of issuing certificates for internal services behind my local firewall. No need to maintain an acme-dns server; just configure the ACME client, set the DNS record and you're done? Very nice.
    - 8cvor6j844qw_d6 an hour ago ago
      Great to hear, one less API keys needed for the DNS records.
  - Reventlov an hour ago ago
    also you can use https://github.com/krtab/agnos if you don't have any api access
    [-]
    - Ajedi32 an hour ago ago
      I hadn't heard of Agnos before, interesting alternative to ACME-DNS.
      Looking at the README, is the idea that the certificates get generated on the DNS server itself? Not by the ACME client on each machine that needs a certificate? That seems like a confusing design choice to me. How do you get the certificate back to the web server that actually needs it? Or is the idea that you'd have a single server which acts as both the DNS server and the web server?
  - cortesoft 2 hours ago ago
    If you are using a non-standard DNS provider that doesn’t have integration with certbot or cert-manager or whatever you are using, it is pretty easy to set up an acme-dns server to handle it
    https://github.com/joohoi/acme-dns
bigbuppo 27 minutes ago ago
For many years now. The crawlers, scanners, and bots start hammering a website within a minute of a certificate being issued. Remember to get your garbage WCM installed and secured before installing the real certificate as you have about a 15 second window before they're hammering around for fresh wordpress installs. Granted, you people are all smart enough to have all that automated using a CI/CD pipeline so that you just commit a single file with the domain name to a git repo and all that magic happens.
poormathskills 2 hours ago ago
Is it still “scraping” when the purpose of these transparency logs is to be used for this purpose?
[-]
- LeifCarrotson a few seconds ago ago
  The ostensible purpose of the certificate transparency logs is to allow validation of a certificate you're looking at - I browse to https://poormathskills.com and want to figure out the details of when its cert was issued.
  The (presumably) unintended, unexpected purpose of the logs is to provide public notification of a website coming online for scrapers, search engines, and script kiddies to attack it: I could register https://verylongrandomdomainnameyoucantguess7184058382940052... and unwisely expect it to be unguessable, but as it turns out OpenAI is going to scrape it seconds after the certificate is issued.
toddgardner an hour ago ago
If you want to learn more about Certificate Transparency Logs, how to pull and search them, we just did a 3 part series about how we did this at CertKit: https://www.certkit.io/blog/searching-ct-logs
throwaway613745 3 hours ago ago
OpenAI is scraping everything that is publicly accessible. Everything.
[-]
- Aachen 3 hours ago ago
  Yet they provide the user agents and IP address ranges which they scrape from, and say they respect robots.txt
  I run a web server and so see a lot of scrapers, but OpenAI is one of the ones that appear to respect limits that you set. A lot of (if not most) others don't even have that ethics standard so I'd not say that "OpenAI scrapes everything they can access. Everything" without qualification, as that doesn't seem to be true, at least not until someone puts a file behind a robots deny page and finds that chatgpt (or another of openai's products) has knowledge of it
- warkdarrior an hour ago ago
  So do Google, Microsoft/Bing, Yandex, etc. How else would they make sure their search/chatbot/q&a products are up to date?
8cvor6j844qw_d6 2 hours ago ago
Anyone went with wildcard certificates to avoid disclosing subdomains in certificate transparency logs?
jcims 3 hours ago ago
Given these are trivially forged, presumably they aren't really using a Mac for scraping, right? Just to elicit a 'standard' end user response from the server?
>useragent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.3; robots.txt;
[-]
- snowwrestler 3 hours ago ago
  Right. Crawler user agent strings in general tend to include all sorts of legacy stuff for compatibility.
  This actually is a well-behaved crawler user agent because it identifies itself at the end.
- Hrun0 3 hours ago ago
  Yes, it is very common to change your useragent for web scraping. Mainly because there are websites which will block you just based on that alone
- benjojo12 3 hours ago ago
  the ip address the this comes from is a OpenAI search bot range:
  > "ipv4Prefix": "74.7.175.128/25"
  from https://openai.com/searchbot.json
basilikum 3 hours ago ago
They definitely do. Before this comment CT logs – aside from DNS queries – were the only way to know about https://onion.basilikum.monster and you have to send the hostname in the SNI, otherwise you get another certificate back.
Of course I get some bot traffic including the OpenAI bot – although I just trust the user agent there, I have not confirmed that the origin IP address of the requests actually belongs to OpenAI.
That's just the internet. I like the bots, especially the wp-admin or .env ones. It's fun watching them doing their thing like little ants.
_pdp_ 3 hours ago ago
I wonder if this can be used to contaminate OpenAI search indexes?
drwhyandhow 5 hours ago ago
This has been long the case! I think there whole business model is based off scraping lol
xpe 2 hours ago ago
Looking around at the comments, I have a birds-eye view. People are quite skilled at jumping to conclusions or assuming their POV is the only one. Consider this simplified scenario to illustrate:
```
    - X happened
    - Person P says "Ah, X happened."
    - Person Q interprets this in a particular way
        and says "Stop saying X is BAD!"
    - Person R, who already knows about X...
        (and indifferent to what others notice
         or might know or be interested in)
        ...says "(yawn)".
    - Person S narrowly looks at Person R and says
        "Oh, so you think Repugnant-X is ok?"
```
What a train wreck. Such failure modes are incredibly common. And preventable.* What a waste of the collective hours of attention and thinking we are spending here that we could be using somewhere else.
See also: the difference between positive and normative; charitable interpretations; not jumping to conclusions; not yucking someone else's yum
* So preventable that I am questioning the wisdom of spending time with any communication technology that doesn't actively address these failures. There is no point at blaming individuals when such failures are a near statistical certainty.
[-]
- 47282847 2 hours ago ago
  I agree with your analysis but try to not agree with your conclusion, purely for my own metal hygiene: I believe one can retrain the pattern matching of one’s brain for happier outcomes. If I let my brain judge this as a “failure“ (judgment “it is wrong“), I will either get sad about it (judgment “… and I can’t change it“) or angry (… and I can do something about it“). In cases such as this I prefer to accept it as is, so I try to rewrite my brain rule to consider it a necessary part of life (judgment “true/good/correct“).
  [-]
  - xpe an hour ago ago
    Ah, in case it didn't come across clearly, my conclusion isn't to blame the individuals. My assessment is to seek out better communication patterns, which is partly about "technology" and partly about culture (expectations). People could indeed learn not to act this way with a bit of subtle nudging, feedback, and mechanism design.
    I'm also pretty content accepting the unpleasant parts of reality without spin or optimism. Sometimes the better choice is still crappy, after all ;) I think Oliver Burkeman makes a fun and thoughtful case in "The Antidote: Happiness for People Who Can't Stand Positive Thinking" https://www.goodreads.com/book/show/13721709-the-antidote
gmerc 3 hours ago ago
Let's prompt inject it
matt3210 2 hours ago ago
Your content is stolen for training the moment you put it up
[-]
- prepend 10 minutes ago ago
  If I give my content away for free, it can’t be stolen.
  The point of putting up a public web site is so the public can view it (including OpenAI/google/etc).
  If I don’t want people viewing it, then I don’t make it public.
  Saying that things are stolen when they aren’t clouds the issue.
- jfindper 2 hours ago ago
  It is an _incredible_ stretch to frame certificate transparency logs as "content" in the creative sense.
  The whole purpose of this data is to be consumed by 3rd-parties.
  [-]
  - integralid an hour ago ago
    I don't see issue with OAI scraping public logs.
    But what GP probably meant is that OAI definitely uses this log to get a list of new websites in order to scrap then later. This is a pretty standard way to use CT logs - you get a list of domains to scrap instead of relying solely on hyperlinks.
  - advisedwang 25 minutes ago ago
    matt3210 clearly means that the content of the website (revealed by the CT log) is what is being stolen, not the data in the CT log
- 0xdeadbeefbabe 2 hours ago ago
  It would be funny if your content disappeared when it was stolen.
mxlje 3 hours ago ago
So? It’s public information and a somewhat easily consumable stream of websites to scrape, if my job was to scrape the entire internet I’d probably start there, too.
kirito1337 an hour ago ago
yawn, i saw this more than 1000 times
privacy doesnt exist in this world
[-]
- nephihaha 39 minutes ago ago
  Of course it doesn't exist if you keep handing it away.