HOW is there not a better, almost 100% accurate OCR tool? I routinely (daily) ne...

danso · on Oct 12, 2016

You do not want OCR for this. You want either ABBYY FineReader (around $99 for a license), or, if you prefer open source, Tabula:

https://source.opennews.org/en-US/articles/introducing-tabul...

The main advantage of ABBYY is that if you need to do OCR, it is, in my opinion, the best consumer-level package. And it does a pretty good job of doing OCR and conversion to Excel. Here's a Github repo that demonstrates some results:

https://github.com/dannguyen/abbyy-finereader-ocr-senate

But to reemphasize, the above repo demonstrates ABBYY maintaining table structure with PDFs that are scanned images, which is considerably harder than the situation you're in.

I've started a repo that eventually will compare text-to-table tools, which is what you want: https://github.com/dannguyen/pdftotablestable

fizixer · on Oct 13, 2016

As much as your response tries to solve GP's particular problem (OCR for PDF-to-text being not the right tool), I 100% agree with the extreme annoyance expressed in it regarding the state of the free OCR.

In principle, text-pdf-to-text is just a matter of parsing PDF (and/or Word) formats and extracting text buried in metadata. (I know it's a lot of work but still).

Even if you forget about what GP said about the source being text PDFs, and when all the sources are png images, as long as those pngs were generated from text documents (Word, PDF, etc) without any scanning or camera involved, it is unacceptable that today's free OCR tools don't get the job done, when in 2016, machine-learning has produced systems that have surpassed human accuracy in much harder tasks like object detection and speech recognition.

I know it's not an unsolved problem. It's just a matter of some knowledgeable machine learning researcher taking a break from working on cutting edge for a few months and putting together a package that gets the image-to-text job done. Once such a base tool is available on github, the community will take over and add features, fix bugs, as needed. (I'm extremely busy with my own degree work ATM, otherwise I would probably do something like that).

EDIT 1: As for tesseract, I hate it with the passion of a thousand fiery suns. It's a kludge, a black-box of traditional-programming karate-chops and overly-complicated bloat that spits out text the way it likes and there is, largely, nothing you can do about it. Compared to machine-learning and modern computer-vision, tesseract belongs to the dark ages. If there is going to be a quality OCR tool, it's has to be written from scratch based on deep-learning from the ground up.

derefr · on Oct 13, 2016

There's a brute-force solution to the "extract text from a 'digital-native' image" problem that you can write in an afternoon:

1. Use an existing OCR library to give you the positions of the words, plus a first-cut guess of their content.

2. Take the first word from the OCRed guess, and loop through a set of {font, size, leading} tuples, rendering out the same word at that {font, size, leading} and overlaying it on the image, and measuring error-distance.

3. If your best match isn't within some minimum error-distance, then assume that the OCR misrecognized the first word, and try again with the second, third, etc.

Once you've got a font-settings match:

4. render the rest of the words onto their respective detected bounding boxes;

5. notice which words have a higher error-distance than the rest;

6. for each word, generate candidate mutations of the word (e.g. everything at a Levenstein distance of 1 from the OCRed guess), pick the one that lowers the error-distance, and repeat until the distance for that word won't go down any lower.

7. Return the error-minimized set of words.

You could call this a form of https://en.wikipedia.org/wiki/Code-excited_linear_prediction, with fonts as the pre-trained models.

---

Actually, come to think of it, it'd be a lot easier to detect and unify "identical" sub-regions of the image first (using e.g. https://en.wikipedia.org/wiki/JBIG2 on a lossless setting). Then you could, in parallel to the above, also try to do frequency-analysis to discover which of your image "tiles" would likely form a basic "alphabet" of character-glyphs—and then hill-climb toward aligning that "alphabet" by attempting to produce the most runs of character-glyphs that translate to known dictionary words in whatever language the OCR thinks the text is in.

The font-matching would still be necessary, though, for the rest of the image samples that don't fall into the easily-frequency-analyzed part. (And for languages that aren't alphabetic, like Chinese, where there are no super-common character-glyphs.)

iplaw · on Oct 13, 2016

Another partner and I came up with a similar solution. It hinged on detecting the typeface and using a bitmapped (or otherwise rendered) font package to OCR letter by letter.

The PDF files that we are dealing with do not have embedded text and are not searchable, but are "digital-native," to use the term that you suggested.

Does this not exist? If not, why does it not exist?!

nabla9 · on Oct 12, 2016

Why do you use OCR and not PDF to text conversion?

angry-hacker · on Oct 12, 2016

Probably because the pdf is just a big image file? If I understand correctly. Otherwise it should be just copy paste from pdf.

iplaw · on Oct 13, 2016

Right. It's an image PDF generated from a text file, so there are no digital-to-analog-to-digital errors introduced. These files should be perfect OCR candidates, but everything that I've found is full of errors, missing portions of sentences, rearranged fragments, etc.

pitaj · on Oct 13, 2016

> The PDF files are not scans. They are PDF files created from a Word file.

I am unsure as to why he can't just copy / paste.

iplaw · on Oct 13, 2016

Apologies. The PDFs that we deal with are digital-native, but do not have embedded text and are not searchable. I simply want to OCR the PDF and spit the text into a Word/text file.

I don't even care about perfect formatting, that's easy to fix. I do care about perfect OCR. That's crucial.

hrehhf · on Oct 12, 2016

Debian has a command-line tool 'pdftotext' which extracts the text from a PDF. It is not OCR, it pulls the characters from the file itself. Its in the package called poppler-utils.

iplaw · on Oct 13, 2016

Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.

dreamcompiler · on Oct 14, 2016

There are a couple of ways a PDF document could contain actual text that is nevertheless not selectable or searchable. One is that the originator could have protected the document; another (more common) cause is that the originator didn't embed the proper font maps when exporting the document. I see the latter a lot with documents produced from LaTeX originals. As the parent mentioned, pdftotext can often extract text from such documents without the need for OCR. (Although sometimes if the document contains ligatures those don't get converted.)

emmelaich · on Oct 13, 2016

@iplaw please clarify -- is the pdf have an image or text?

iplaw · on Oct 13, 2016

Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.

ousta · on Oct 12, 2016

it dosnt make sense to use ocr for this. libraries such as aspose will do much better

iplaw · on Oct 13, 2016

Apologies for not being clear in my OP. The PDF is a digital-native image produced from a text document, but without embedded or searchable text. Looking at the PDF in full resolution, there are not artifacts, blurry characters, or any alignment or uneven scale issues that are troublesome when attempt to OCR a scan or photograph. It looks exactly like a Word document, but without selectable or editable text.

So, I do have to use OCR, right?

niutech · on Oct 17, 2016

Maybe not. The PDF probably has an embedded text (so it doesn't blur when zomming in) but it could be either cinverted into vector curves or protected from copying (see properties). The easiest way is to change the PDF export settings in Word/Ghostscript/Distiller.