Zend Lucene And PDF Documents Part 3: Indexing The Documents

Note: This post is over two years old and so the information contained here might be out of date. If you do spot something please leave a comment and we will endeavour to correct.

5th November 2009 - 22 minutes read time

Last time we had reached the stage where we had PDF meta data and the extracted contents of PDF documents ready to be fed into our search indexing classes so that we can search them.

The first thing that is needed is a couple of configuration options to be set up. This will control where our Lucene index and the PDF files to be indexed will be kept. Add the following options to your configuration files (called application.ini if you used Zend Tool to create your applcation).

luceneIndex = \path\to\lucene\index
filesDirectory = \path\to\pdf\files\

You can load these options into the the Zend_Registry by using the following method in your Bootstrap.php file.

protected function _initConfigLoad()
{
    $config = new Zend_Config_Ini(APPLICATION_PATH . '/configs/application.ini', APPLICATION_ENV);
    Zend_Registry::set('config', $config);
}

The APPLICATION_ENV constant should be defined in the index file.

Next we need some way of getting our application to index all of our PDF files, to do this I created an action in the Index conrtoller called indexpdfAction(). What this function will do is run through every file in our PDF folder and add it to our Lucene index. When finished indexing the action will send the number of documents in the index and the index size to the view so that we can see how many files were indexed. The following code block contains the full source code for that action. Note that this code won't run because most of the code behind it doesn't exist yet.

public function indexpdfsAction() {
    $config = Zend_Registry::get('config');
    $appLucene = App_Search_Lucene::open($config->luceneIndex);
    $globOut = glob($config->filesDirectory . '*.pdf');
    if (count($globOut) > 0) { // make sure the glob array has something in it
        foreach ($globOut as $filename) {
            $index = App_Search_Lucene_Index_Pdfs::index($filename, $appLucene);
        }
    }
    $appLucene->commit();
    if ($appLucene != null) {
        $this->view->indexSize = $appLucene->count();
        $this->view->documents = $appLucene->numDocs();
    }
}

The first line of this code deals with getting our config object from the registry so that we can use it to find out where our Lucene index and PDF documents are on the file system.

The next line calls the static open() method of a class called App_Search_Lucene and passes the location of our Lucene index. This class extends the Zend_Search_Lucene object so that we can have an extra level of control when adding and deleting documents from our index, as well as creating the index in the first place. We are using an extension to control how documents are added to our index because it is not possible to update a document through Zend_Search_Lucene, we must first delete the document before re-adding it to our index.

The open() method tries to open the Lucene index via the use of a class called Zend_Search_Lucene_Proxy, if the index doesn't exist yet then it is created using the create() method. The Zend_Search_Lucene_Proxy object is used to provide a gateway between our application and the methods available in the Zend_Search_Lucene object, but still allow us to control how files are added to the index.

class App_Search_Lucene extends Zend_Search_Lucene
{
    /**
     * Create a new index.
     *
     * @param string $directory         The location of the Lucene index.
     * @return Zend_Search_Lucene_Proxy The Lucene index.
     */
    public static function create($directory)
    {
        return new Zend_Search_Lucene_Proxy(new App_Search_Lucene($directory, true));
    }
 
    /**
     * Open the index. If the index does not exist then one will be created and
     * returned.
     *
     * @param string $directory         The location of the Lucene index.
     * @return Zend_Search_Lucene_Proxy The Lucene index.
     */
    public static function open($directory)
    {
        try {
            // Attempt to open the index.
            return new Zend_Search_Lucene_Proxy(new App_Search_Lucene($directory, false));
        } catch (Exception $e) {
            // Return a newly created index using the create method of this class.
            return self::create($directory);
        }
    }
 
    /**
     * Add a document to the index.
     *
     * @param Zend_Search_Lucene_Document $document The document to be added.
     * @return Zend_Search_Lucene                   
     */
    public function addDocument(Zend_Search_Lucene_Document $document)
    {
        // Search for documents with the same Key value.
        $term = new Zend_Search_Lucene_Index_Term($document->Key, 'Key');
        $docIds = $this->termDocs($term);
 
        // Delete any documents found.
        foreach ($docIds as $id) {
            $this->delete($id);
        }
 
        return parent::addDocument($document);
    }
}

The second object we use in this application is called App_Search_Lucene_Index_Pdfs. This has a single method called index() which takes two parameters. The first is the path to the PDF document and the second is the Lucene index. What this method does is to find out as much information about the PDF document as possible and then add the document to the Lucene index. It does this last part by creating an instance of an object called App_Search_Lucene_Document (explained later on in the post) and sends this to the addDocument() method of the App_Search_Lucene object. This class is where all of the code from the previous two posts comes into play. After opening the PDF document the object then reads the meta data into an array before extracting the textual content of the PDF and adding this to the array. This class is fairly self explainitory and uses code that you have already seen in this series so I won't go into it in great detail. I have put comments within the code at key points to explain what is going on.

class App_Search_Lucene_Index_Pdfs
{
    /**
     * Extract data from a PDF document and add this to the Lucene index.
     *
     * @param string $pdfPath                       The path to the PDF document.
     * @param Zend_Search_Lucene_Proxy $luceneIndex The Lucene index object.
     * @return Zend_Search_Lucene_Proxy
     */
    public static function index($pdfPath, $luceneIndex)
    {
        // Load the PDF document.
        $pdf = Zend_Pdf::load($pdfPath);
        $key = md5($pdfPath);
 
        /**
         * Set up array to contain the document index data.
         * The Filename will be used to retrive the document if it is found in
         * the search resutls.
         * The Key will be used to uniquely identify the document so we can
         * delete it from the search index when adding it.
         */
        $indexValues = array(
            'Filename'     => $pdfPath,
            'Key'          => $key,
            'Title'        => '',
            'Author'       => '',
            'Subject'      => '',
            'Keywords'     => '',
            'Creator'      => '',
            'Producer'     => '',
            'CreationDate' => '',
            'ModDate'      => '',
            'Contents'     => '',
        );
 
        // Go through each meta data item and add to index array.
        foreach ($pdf->properties as $meta => $metaValue) {
            switch ($meta) {
                case 'Title':
                    $indexValues['Title'] = $pdf->properties['Title'];
                    break;
                case 'Subject':
                    $indexValues['Subject'] = $pdf->properties['Subject'];
                    break;
                case 'Author':
                    $indexValues['Author'] = $pdf->properties['Author'];
                    break;
                case 'Keywords':
                    $indexValues['Keywords'] = $pdf->properties['Keywords'];
                    break;
                case 'CreationDate':
                    $dateCreated = $pdf->properties['CreationDate'];
 
                    $distance = substr($dateCreated, 16, 2);
                    if (!is_long($distance)) {
                        $distance = null;
                    }
                    // Convert date from the PDF format of D:20090731160351+01'00'
                    $dateCreated = mktime(substr($dateCreated, 10, 2), //hour
                        substr($dateCreated, 12, 2), //minute
                        substr($dateCreated, 14, 2), //second
                        substr($dateCreated,  6, 2), //month
                        substr($dateCreated,  8, 2), //day
                        substr($dateCreated,  2, 4), //year
                        $distance); //distance
                    $indexValues['CreationDate'] = $dateCreated;
                    break;
                case 'Date':
                    $indexValues['Date'] = $pdf->properties['Date'];
                    break;
            }
        }
 
        /**
         * Parse the contents of the PDF document and pass the text to the
         * contents item in the $indexValues array.
         */
        $pdfParse                = new App_Search_Helper_PdfParser();
        $indexValues['Contents'] = $pdfParse->pdf2txt($pdf->render());
 
        // Create the document using the values
        $doc = new App_Search_Lucene_Document($indexValues);
        if ($doc !== false) {
            // If the document creation was sucessful then add it to our index.
            $luceneIndex->addDocument($doc);
        }
 
        // Return the Lucene index object.
        return $luceneIndex;
    }
}

The Key attribute is used to uniquely identify the file in the index quickly and easily. For this class I have made it an md5 of the filename, but this can be changed to something different.

The App_Search_Lucene_Document class is the one class that needs the most explanation, mainly due to the decisions made when creating it. The class extends the Zend_Search_Lucene_Document class and so acts just like a normal Lucene document. The constructor is passed a single parameter containing an associative array of values that are to be written to the document. The rest of this class then deals with adding the items of the array to the document as field objects. The following code block contains the source code for this class.

class App_Search_Lucene_Document extends Zend_Search_Lucene_Document
{
 
    /**
     * Constructor.
     *
     * @param array $values An associative array of values to be used
     *                      in the document.
     */
    public function __construct($values)
    {
        // If the Filename or the Key values are not set then reject the document.
        if (!isset($values['Filename']) && !isset($values['key'])) {
            return false;
        }
 
        // Add the Filename field to the document as a Keyword field.
        $this->addField(Zend_Search_Lucene_Field::Keyword('Filename', $values['Filename']));
        // Add the Key field to the document as a Keyword.
        $this->addField(Zend_Search_Lucene_Field::Keyword('Key', $values['Key']));
 
        if (isset($values['Title']) && $values['Title'] != '') {
            // Add the Title field to the document as a Text field.
            $this->addField(Zend_Search_Lucene_Field::Text('Title', $values['Title']));
        }
 
        if (isset($values['Subject']) && $values['Subject'] != '') {
            // Add the Subject field to the document as a Text field.
            $this->addField(Zend_Search_Lucene_Field::Text('Subject', $values['Subject']));
        }
 
        if (isset($values['Author']) && $values['Author'] != '') {
            // Add the Author field to the document as a Text field.
            $this->addField(Zend_Search_Lucene_Field::Text('Author', $values['Author']));
        }
 
        if (isset($values['Keywords']) && $values['Keywords'] != '') {
            // Add the Keywords field to the document as a Keyword field.
            $this->addField(Zend_Search_Lucene_Field::Keyword('Keywords', $values['Keywords']));
        }
 
        if (isset($values['CreationDate']) && $values['CreationDate'] != '') {
            // Add the CreationDate field to the document as a Text field.
            $this->addField(Zend_Search_Lucene_Field::Text('CreationDate', $values['CreationDate']));
        }
 
        if (isset($values['ModDate']) && $values['ModDate'] != '') {
            // Add the ModDate field to the document as a Text field.
            $this->addField(Zend_Search_Lucene_Field::Text('ModDate', $values['ModDate']));
        }
 
        if (isset($values['Contents']) && $values['Contents'] != '') {
            // Add the Contents field to the document as an UnStored field.
            $this->addField(Zend_Search_Lucene_Field::UnStored('Contents', $values['Contents']));
        }
    }
}

There are five different types of field objects available, and each acts in a different way. The three field types I have used in this class are Keyword, Text and UnStored. Here is a brief explanation of each and why they were chosen.

Keyword - These fields are stored and indexed and so are available when searching. However, no processing or tokenizing is done on the string so the entire string is stored as is. I selected this for the Filename and Key fields because it is best for strings that are searched for in full
Text - Text fields are stored, indexed and tokenized. The majority of fields in this class are stored as text fields as they shouldn't be too long and can be displayed in the search results if we need them to.
UnStored - UnStored fields are treated like Text fields except that they are not stored within the index. This type of field is ideal when dealing with potentially large amounts of text and as a result the Contents are dealt with in this way. Rather than take up disk space and store the entire document in the index it is best to allow Lucene to tokenize the text so that it can be searched. Using the UnStored field means that we can't print out the field in our search results, but there isn't any need to here as we will have all the information needed to describe and provide a link to our PDF documents.

See the Zend Search Lucene field types documentation for more information about the different types of fields available in Zend_Search_Lucene.

The App_Search_Lucene_Document class will return false in the constructor if the Filename or Key values are not present. It would be possible to create the Key within this class but it made sense to keep the processing of the data in the App_Search_Lucene_Index_Pdfs class and the document creation in the App_Search_Lucene_Document completely separate.

Remember that all class names are dependent on their location within the library folder of our application. So the class App_Search_Lucene_Index_Pdfs is called Pdfs.php and would be located in \App\Search\Lucene\Index. For clarity I have written out where each file should be located.

--application
--library
----App 
------Search
--------Helper
----------PdfParser.php
--------Lucene
----------Index
------------Pdfs.php
--------Document.php
------Lucene.php
----Zend

This directory structure allows for future change if we want to add different file types to our indexing service, or even change to a different search engine like Xapian.

In the next instalment of Zend Lucene and PDF documents I will be showing you how to add a search form to the application, so that we can search for the documents we have indexed. I will be making all of the source code available in the final episode so keep posted if you want to get hold of it.

Comments

Hello, Thank you very much for this tutorials, i have a question: In your class App_Search_Lucene_Index_Pdfs you wrote $pdfParse = new App_Search_Helper_PdfParser(); normally the class App_Search_Helper_PdfParser must be defined! where can i find it? thnks in advance.

Omar

Tue, 01/08/2013 - 09:49

Check out page 2, the class is listed there:
http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-p…

The full source code is also available on github:
https://github.com/philipnorton42/PDFSearch

philipnorton42

Tue, 01/08/2013 - 12:12

Zend Lucene And PDF Documents Part 3: Indexing The Documents

Comments

Add new comment

Related Content

Leaving Meetup.com And Extracting Past Event Data Without API Access

A Look At HTMX With PHP

Converting Images To The Colour Pallet From The Matrix In PHP

A Look At Flood Fill Algorithms In PHP

A Look At Benford's Law

Protecting A Page From Being Directly Accessed With PHP