Secrets of the paperless office: optimizing OCR

Since I started using a document scanner about seven years ago, I’ve scanned many thousands of pages and used OCR (optical character recognition) software to convert those scans into searchable PDFs. I’ve also written extensively about the paperless office. But when you try to reduce the amount of paper you use, you inevitably increase the amount of hard-drive space you use. I began to wonder what combinations of scanner settings and software would get the best quality scan results while using the least hard-disk space.

What sparked my investigation was a claim that some OCR apps increase the file sizes of scanned images dramatically, whereas others (Acrobat Pro in particular) shrink them. When you plan to store and read scanned documents on an iOS device, compactness is especially important. Unfortunately, Adobe’s $499 Acrobat Pro XI () can no longer be driven externally by AppleScript, which means it requires tedious manual clicking to perform OCR. Were other OCR apps really inflating file sizes, and was there any way around this problem without resorting to Acrobat?

Hundreds of experiments later, I came up with some surprising results. Read on for all the details or skip to the “So, where’s the sweet spot?” section for the bottom line.

The ins and outs of OCR

When you initially save a scanned document as a PDF file, you get nothing more than a bitmapped image in a PDF wrapper. Your scanner’s software most likely has settings to determine the resolution of the scans in dpi (dots per inch), the color mode (black and white, grayscale, or color), and the amount of compression applied to the scanned image. All those settings affect not only the appearance of the scan but also the quality of information the OCR engine has to work with. Once OCR software recognizes the text in a PDF, it saves that text in an invisible layer along with the image so you can see what the document originally looked like, but can also search, select, and copy its text.

To read this article in full or to leave a comment, please click here