Zend Lucene And PDF Documents Part 2: PDF Data Extraction

26th October 2009 - 10 minutes read time

Last time we looked at viewing and saving meta data to PDF documents using Zend Framework. The next step before we try to index them with Zend Lucene is to extract the data out of the documents themselves. I should note here that we can't extract the data perfectly from every PDF document, we certainly can't extract any images or tables from the PDF into any recognisable text. There is a little issue with extracting the text because we are essentially looking at compressed data. The text isn't saved into the document, it is rendered into the document using a font. So what we need to do is extract this data into some format the Lucene can tokenize. Because we are just getting the text out of the document for our search index we can take a few short-cuts in order to get as much textual data out of the document as possible. All of this data might not be fully readable and we will definitely loose any formatting and images, but for the purposes we are using it for we don't really need it. The idea is that we can retrieve as much relevant and indexable content for Zend Lucene to tokenize. Also, it is not possible to extract the data out of encrypted PDF documents.

Install PHPDocumentor

21st January 2009 - 2 minutes read time

PHPDocumentor is a fast and convent way of creating API documentation for your PHP programs and classes. If you are familiar with the world of Java, it works in much the same way as the JavaDoc program, indeed, it is based on this program.

PHPDocumentor can be run in a number of different ways, but I have found that the easiest way is to, again, use PEAR to install everything you need. To install PHPDocumentor using PEAR use the following command.

phpdoc install phpdocumentor

To run PHPDocumentor and see a list of commands just type in the following:

phpdoc -h

To run PHPDocumentor you need to provide a couple of options, these are: