HOW is there not a better, almost 100% accurate OCR tool?
I routinely (daily) need to OCR PDF files. The PDF files are not scans. They are PDF files created from a Word file. The text is 100% clear, the lines are 100% straight, and the type is 100% uniform.
And, yet, Microsoft and Google OCR spits out gibberish that is full of critical errors.
From a problem solving perspective, this seems like an incredibly easy problem to solve in this exact use case. That is, PDFs generated from text files. Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn thing. And yet, the output is useless in my field.
The main advantage of ABBYY is that if you need to do OCR, it is, in my opinion, the best consumer-level package. And it does a pretty good job of doing OCR and conversion to Excel. Here's a Github repo that demonstrates some results:
But to reemphasize, the above repo demonstrates ABBYY maintaining table structure with PDFs that are scanned images, which is considerably harder than the situation you're in.
As much as your response tries to solve GP's particular problem (OCR for PDF-to-text being not the right tool), I 100% agree with the extreme annoyance expressed in it regarding the state of the free OCR.
In principle, text-pdf-to-text is just a matter of parsing PDF (and/or Word) formats and extracting text buried in metadata. (I know it's a lot of work but still).
Even if you forget about what GP said about the source being text PDFs, and when all the sources are png images, as long as those pngs were generated from text documents (Word, PDF, etc) without any scanning or camera involved, it is unacceptable that today's free OCR tools don't get the job done, when in 2016, machine-learning has produced systems that have surpassed human accuracy in much harder tasks like object detection and speech recognition.
I know it's not an unsolved problem. It's just a matter of some knowledgeable machine learning researcher taking a break from working on cutting edge for a few months and putting together a package that gets the image-to-text job done. Once such a base tool is available on github, the community will take over and add features, fix bugs, as needed. (I'm extremely busy with my own degree work ATM, otherwise I would probably do something like that).
EDIT 1: As for tesseract, I hate it with the passion of a thousand fiery suns. It's a kludge, a black-box of traditional-programming karate-chops and overly-complicated bloat that spits out text the way it likes and there is, largely, nothing you can do about it. Compared to machine-learning and modern computer-vision, tesseract belongs to the dark ages. If there is going to be a quality OCR tool, it's has to be written from scratch based on deep-learning from the ground up.
There's a brute-force solution to the "extract text from a 'digital-native' image" problem that you can write in an afternoon:
1. Use an existing OCR library to give you the positions of the words, plus a first-cut guess of their content.
2. Take the first word from the OCRed guess, and loop through a set of {font, size, leading} tuples, rendering out the same word at that {font, size, leading} and overlaying it on the image, and measuring error-distance.
3. If your best match isn't within some minimum error-distance, then assume that the OCR misrecognized the first word, and try again with the second, third, etc.
Once you've got a font-settings match:
4. render the rest of the words onto their respective detected bounding boxes;
5. notice which words have a higher error-distance than the rest;
6. for each word, generate candidate mutations of the word (e.g. everything at a Levenstein distance of 1 from the OCRed guess), pick the one that lowers the error-distance, and repeat until the distance for that word won't go down any lower.
Actually, come to think of it, it'd be a lot easier to detect and unify "identical" sub-regions of the image first (using e.g. https://en.wikipedia.org/wiki/JBIG2 on a lossless setting). Then you could, in parallel to the above, also try to do frequency-analysis to discover which of your image "tiles" would likely form a basic "alphabet" of character-glyphs—and then hill-climb toward aligning that "alphabet" by attempting to produce the most runs of character-glyphs that translate to known dictionary words in whatever language the OCR thinks the text is in.
The font-matching would still be necessary, though, for the rest of the image samples that don't fall into the easily-frequency-analyzed part. (And for languages that aren't alphabetic, like Chinese, where there are no super-common character-glyphs.)
Another partner and I came up with a similar solution. It hinged on detecting the typeface and using a bitmapped (or otherwise rendered) font package to OCR letter by letter.
The PDF files that we are dealing with do not have embedded text and are not searchable, but are "digital-native," to use the term that you suggested.
Does this not exist? If not, why does it not exist?!
Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.
Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.
I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.
Debian has a command-line tool 'pdftotext' which extracts the text from a PDF. It is not OCR, it pulls the characters from the file itself. Its in the package called poppler-utils.
Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.
There are a couple of ways a PDF document could contain actual text that is nevertheless not selectable or searchable. One is that the originator could have protected the document; another (more common) cause is that the originator didn't embed the proper font maps when exporting the document. I see the latter a lot with documents produced from LaTeX originals.
As the parent mentioned, pdftotext can often extract text from such documents without the need for OCR. (Although sometimes if the document contains ligatures those don't get converted.)
Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.
Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.
Maybe not. The PDF probably has an embedded text (so it doesn't blur when zomming in) but it could be either cinverted into vector curves or protected from copying (see properties). The easiest way is to change the PDF export settings in Word/Ghostscript/Distiller.
I routinely (daily) need to OCR PDF files. The PDF files are not scans. They are PDF files created from a Word file. The text is 100% clear, the lines are 100% straight, and the type is 100% uniform.
And, yet, Microsoft and Google OCR spits out gibberish that is full of critical errors.
From a problem solving perspective, this seems like an incredibly easy problem to solve in this exact use case. That is, PDFs generated from text files. Identify a uniform font size (prevent o-to-O and o-to-0 errors), identify a font-family (serif, sans-serif, narrow to particular fonts), and OCR the damn thing. And yet, the output is useless in my field.