Apple Mac OS X tips and tweaks
Date: 9th June 2011
Share this page with your friends! Tweet
Turn paper documents into editable files with OCR
Spotlight is a fantastic utility for finding things on the Apple Mac. Click the magnifying glass icon in the top right corner of the screen and just type in a few characters. As you type it instantly displays search results and sometimes it comes up with the app, document or file you are looking for before you have even finished entering the whole search term. If only searching for paper documents was so easy. Well it can be if you scan them in to your Mac.
Using OCR (Optical Character Recognition) software you can extract the text from scanned paper documents and then Spotlight will index it. You can instantly find any document. OCR software will also enable you to recreate the paper document on the Mac. This enables you to copy and paste the text into another application, document or report without having to type it in or to edit, modify and then reprint the original documentn on paper. There are lots of uses for OCR, so let's see how it is done on the Mac.
You need some means of getting the paper document into the Mac and this can be carried out in two different ways. The best and the easiest method is to use a scanner. This device plugs into the USB port on the Mac and enables you to turn printed material into images saved to disk. Instead of a dedicated scanner though, you could use one of those multifunction devices that are so popular these days. They scan, print, and photocopy. What's more, they are very cheap.
Many OCR applications allow you to load images and these may be scans, but they could come from other sources. For example, it is possible to photograph documents, magazines and other printed material with a digital camera and then load the images into the OCR app. You can even take a screenshot and load it into an OCR program to turn it into text.
You don't even need any hardware. Some PDFs contain images rather than editable and indexable text and PDFs can often be loaded and the text extracted using OCR.
Free OCR apps
There aren't many free OCR applications available for the Mac and some are awkward command line tools or are very primitive. One that is worth trying though is PDF OCR X. The PDF in the name is because it can turn image PDFs that were created by the scan-to-PDF function of many scanners and scanning applications. It can't control a scanner and you can't directly scan documents into it, but you can use the software bundled with the scanner to scan documents and save the images to disk. These images can then be loaded into the program and turned into an editable text file using OCR.
There are two versions of PDF OCR X and there is a free community version that converts single pages at a time. It's not useful for converting multipage PDFs, but you can load any number of scans and convert them a page at a time.
Drop a .tiff file in it (a scanned document) and it creates a PDF containing the image and a text file containing the text. You can view the document scan in Preview and load the text into TextEdit. TextEdit will then spellcheck it and highlight any mistakes made by the OCR software. The accuracy is quite good, but it depends on how good the scan is and the fonts used. TextEdit highlights the errors and you can easily correct them.
Other OCR apps
OmniPage Pro X for Macintosh is expensive at $499 and is aimed at businesses. It is good, but it is a shame there isn't a cheaper version for personal use as there is with the Windows version.
Abbyy FineReader Express is much more affordable for personal use and it copes with complex document layouts, such as muilticolumn text with images ebmbedded within them. It is straightforward to use and it produces good results.
You end up with a document that contains all the text of the scanned image and the accuracy is excellent. There is little work to do afterwards to produce a finished document you can edit. There is a free trial version if you want to try it out before you buy it. It is available on the Mac App store too, so you can see screen shots and read user comments and ratings.
If you want an OCR program, but have a limited budget then take a look at VelOCRaptor. This app reads image files and performs OCR on them to extract the text. It then saves the image and the text in a PDF file. The resulting PDF can be used to view the original document and all the text is there too, so you can search it, copy it and so on. VelOCRaptor uses the Google-sponsored OCRopus OCR engine.
Paperless is more of a document manager that enables you to do away with paper altogether. You can scan all your documents and manage them on the computer. It has OCR capabilities built in and these can be used to extract the text from printed material and it helps with the document management.
Go to Apple Mac and OS X tips and tweaks index...