Sunday, 15 January 2012

Converting PDF Files To Text Or HTML From Linux Terminal

Earlier, we saw how we can merge or combine PDF files from terminal. Now, I am sharing two command line tools to convert PDF files to text or html files.

Poppler Utils is a great package of PDF rendering and conversion tools and should be installed before we convert PDF files to text or html files. You can install the poppler-utils issuing the following command in debian based distro. You can install them in your favorite distros using their corresponding package installers.

sudo apt-get install poppler-utils

Now that poppler-utils is installed, we will be able to convert PDF files to text and HTML using pdftotext and pdftohtml command-line tools.

PDF to Text


To convert a PDF files to text, we should use pdftotext command. Following is the simplest form of the command for converting a PDF file to text file.

pdftotext file.pdf file.txt

This command also allows you to preserve the original layout in the pdf file using the -layout switch as below:

pdftotext -layout file.pdf file.txt

Similarly, if you wish to convert pages of specific range, you can use -f and -l switches to specify the first and last page to convert to text file. An example below would clarify things where I've choosen to convert pages from 4 to 8 into text.

pdftotext -f 4 -l 8 file.pdf file.txt

Check the man page of pdftotext and also see the help for the tool to explore other options as well.

PDF to HTML


To convert a PDF file to HTML file, you can use the pdftohtml tool available in the poppler package. Before that, I will show how to use pdftotext command to convert the PDF file to HTML file.

pdftotext -f 4 -l 8 -htmlmeta file.pdf file.html

Now, using the pdftohtml tool is not that different than pdftotext. A simplest form would be as below:

pdftohtml file.pdf file.html

You can use the same arguments as in the pdftotext for this tool as well for specifying the range. However, -htmlmeta and -layout are only available in pdftotext. I would let you explore more on the pdftohtml tool.

I hope this information is useful for you. :)