Got a PDF document that you want to extract all the text out of? What about the image files of a scanned document that you want to convert to editable text? These are some of the most common problems I’ve encountered at work when working with files.
In this article, I will talk about some of the different things you can do when trying to extract text from a PDF or from an image. Your extraction results will vary depending on the type and quality of the text in the PDF or image. Also, your results will vary depending on the tool you use, so it’s best to try out as many of the options below as possible to get the best results.
Table of Contents
Extract text from images or PDF
The simplest and fastest way to get started is to try an online PDF text extraction service. They are usually free and can give you exactly what you are looking for without having to install anything on your computer. Here are two that I’ve used with very good to excellent results:
ExtractPDF
ExtractPDF is a free tool to get images, text and fonts out of a PDF file. The only limitation is that the maximum size for a PDF file is 10 MB. It’s a little small; so if you have larger file try some other methods below. Select your file and then click Send files button. Results are usually very quick, and you’ll see a preview of the text when you click the Text tab.
It’s also a great added benefit that it also extracts images out of PDFs, just in case you need them! Overall, the online tool works great, but I’ve run into a few PDF documents that give me funny results. The text is extracted fine, but for some reason it gets a line break after every word! Not a big deal for a short PDF, but definitely a problem for files with a lot of text. If that happens to you, try the next tool.
Online OCR
Online OCR tends to work for documents that aren’t properly converted with ExtractPDF, so you should try both services to see which gives you better results. Online OCR also has some nicer features that can prove handy for anyone with a large PDF file that only needs to convert text on a few pages, not the entire document.
The first thing you want to do is go ahead and create a free account. It’s a bit annoying, but if you don’t create a free account, it will only convert part of your PDF, not the entire document. Plus, instead of just being able to upload a single 5MB document, you can upload up to 100MB per file using the account.
First, select a language and then choose the type of output format you want for the converted file. You have several options and you can choose more than one if you like. Below Multi-page documentyou can choose Number of pages and then select only the pages that you want to convert. Then you select the file and click Converted into!
After converting you will be taken to the Documents section (if you are logged in) where you can see how many free pages you have left and links to download your converted files . It looks like you only get 25 free pages per day, so if you need more than that, you’ll have to wait a bit or buy more.
Online OCR did an excellent job at converting my PDFs as it was able to maintain the actual layout of the text. In my testing, I took a Word document that used bullets, different font sizes, etc and converted it to a PDF. Then I use OCR Online to convert it back to Word format and it’s about 95% like the original. That was pretty impressive to me.
Also, if you are looking to convert an image to text, OCR Online can make it as easy as extracting text from PDF files.
Free Online OCR
Since talking about image-to-text OCR, let me mention another good site that works great on images. Free Online OCR very good and very accurate when extracting text from my test images. I took a few photos from my iPhone of pages from books, flyers, etc, and I’m amazed at its text conversion capabilities.
Select your file and then click the Upload button. On the next screen, there are several options and previews of the image. You can trim it if you don’t want to OCR the whole thing. Then just click the OCR button and your converted text will appear below the image preview. It also doesn’t have any limits, which is really good.
In addition to online services, there are two freeware PDF converters that I would like to mention in case you need software running locally on your computer to do the conversion. With online services, you will always need an Internet connection, and that may not be possible for everyone. However, I’ve noticed that the conversion quality from freeware programs is significantly worse than the quality of the web pages.
A-PDF Text Extractor
A-PDF Text Extractor is free software that does a pretty good job of extracting text from PDF files. After you download and install, click the Open button to select your PDF file. Then click Extract Text to start the process.
It will ask you for a location to save the text output file and then it will start decompressing. You can also click Right to buy , which allows you to select only certain pages to extract and the type of extraction. The latter option is interesting because it extracts text in different layouts, and it’s worth trying all three to see which option gives you the best results.
Pilot PDF2Text
Pilot PDF2Text does a good job of extracting text. It doesn’t have any options; you just add files or folders, convert and hope for the best. It works fine on some PDF files, but for the majority of them there are a lot of problems.
Just click Add file and then click Converted into. When the conversion is complete, click Browse to open the file. Your mileage will vary using this program, so don’t expect much.
Also, it’s worth mentioning that if you’re in a corporate environment or can get a copy of Adobe Acrobat from work, you can actually achieve much better results. Acrobat is obviously not free, but it has options to convert PDF to Word, Excel and HTML formats. It also does the best job maintaining the structure of the original document and converting complex text.