Show HN: Ocrbase – pdf → .md/.json document OCR and structured extraction API

(github.com)

64 points | by adammajcher 6 hours ago ago

21 comments

sync 3 hours ago ago
This is essentially a (vibe-coded?) wrapper around PaddleOCR: https://github.com/PaddlePaddle/PaddleOCR
The "guts" are here: https://github.com/majcheradam/ocrbase/blob/7706ef79493c47e8...
[-]
- M4R5H4LL an hour ago ago
  Most production software is wrappers around existing libraries. The relevant question is whether this wrapper adds operational or usability value, not whether it reimplements OCR. If there are architectural or reliability concerns, it’d be more useful to call those out directly.
  [-]
  - tuwtuwtuwtuw 12 minutes ago ago
    Sure. The self host guide tells me to enter my github secret, in plain-text, in an env file. But it doesn't tell me why I should do that.
    Do people actually store their secrets in plain text on the file system in production environments? Just seems a bit wild to me.
- Oras 3 hours ago ago
  Claude is included in the contributors, so the OP didn’t hide it
- Tiberium 2 hours ago ago
  At this point it feels like HN is becoming more like Reddit, most people upvote before actually checking the repo.
v3ss0n 4 hours ago ago
How this is better over Surya/Marker or kreuzberg https://github.com/kreuzberg-dev/kreuzberg.
[-]
- jadbox 4 hours ago ago
  Sounds like someone needs to run their own test cases and report back on which solution does a better job...
  [-]
  - kspacewalk2 an hour ago ago
    Let me fire up Claude code.
    [-]
    - sixtyj an hour ago ago
      Let me fire up Tesseract.
      https://github.com/tesseract-ocr
      [-]
      - Jimmc414 16 minutes ago ago
        I fought with Tesseract for quite a while. Its good if high accuracy doesn't matter. Transcribing a book from clean, consistent non-skewed data its fine and an LLM might even be able to clean it up. But for legal or accounting data from hand scanned documents, the error rate made it untenable. Even clean, scanned documents of the same category have all sorts of density and skew anomalies that get misinterpreted. You'll pull your hair out trying to account for edge cases and never get the results you need even with numerous adjustments and model retraining on errors.
        Flash 2.5 or 3 with thinking gave the best results.
hersko 5 hours ago ago
I have a flow where i extract text from a pdf with pdf-parse and then feed that to an ai for data extraction. If that fails i convert it to a png and send the image for data extraction. This works very well and would presumably be far cheaper as i'm generally sending text to the model instead of relying on images. Isn't just sending the images for ocr significantly more expensive?
[-]
- unrahul an hour ago ago
  I have seen this flow in what people in some startups call "Agentic OCR", its essentially a control flow that is coded that tries pdf-parse first or a similar non expensive approach, and if it fails a threshold then use screenshot to text extraction.
- saaaaaam 4 hours ago ago
  There was an interesting discussion on here a couple of months back about images vs text, driven by this article: https://www.seangoedecke.com/text-tokens-as-image-tokens/
  Discussion is here: https://news.ycombinator.com/item?id=45652952
- trollbridge 4 hours ago ago
  I always render an image and OCR that so I don’t get odd problems from invisible text and it also avoids being affected by anything for SEO.
- mimim1mi 5 hours ago ago
  By definition, OCR means optical character recognition. It depends on the contents of the PDF what kind of extraction methodology can work. Often some available PDFs are just scans of printed documents or handwritten notes. If machine readable text is available your approach is great.
sgc 5 hours ago ago
How does this compare to dots.ocr? I got fantastic results when I tested dots.
https://github.com/rednote-hilab/dots.ocr
[-]
- mjrpes 4 hours ago ago
  Ocrbase is CUDA only while dots.ocr uses vLLM, so should support ROCm/AMD cards?
  [-]
  - actionfromafar 3 hours ago ago
    How about CPU?
cess11 17 minutes ago ago
Why is 12GB+ VRAM a requirement? The OCR model looks kind of small, https://huggingface.co/PaddlePaddle/PaddleOCR-VL/tree/main, so I'm assuming it is some processing afterwards it would be used for.
constantinum 3 hours ago ago
What matters most is how well OCR and structured data extraction tools handle documents with high variation at production scale. In real workflows like accounting, every invoice, purchase order, or contract can look different. The extraction system must still work reliably across these variations with minimal ongoing tweaks.
Equally important is how easily you can build a human-in-the-loop review layer on top of the tool. This is needed not only to improve accuracy, but also for compliance—especially in regulated industries like insurance.
Other tools in this space:
LLMWhisperer/Unstract(AGPL)
Reducto
Extend Ai
LLamaparse
Docling
mechazawa 5 hours ago ago
Is only bun supported or also regular node?