Zend Lucene And PDF Documents Part 1: PDF Meta Data

Thursday, October 22, 2009 - 13:08

Zend Lucene is a powerful search engine, but it does take a bit of setting up to get it working properly. One thing that I have had trouble getting up and running in the past is indexing and searching PDF documents. The difficulty here is that it isn't immediately apparent how you can index the contents of a PDF document with ease. I came across a couple of functions you can try out, but even if that doesn't work it is possible to create and edit PDF meta data using the Zend_Pdf library. Because there is a lot to cover on this subject I thought I would create a blog post in multiple parts. For this post I will be looking at how to add and edit this meta data. This meta data can be used to classify your PDF documents and allow you to index them and provide a decent search solution using Zend Lucene.

Listing The Files

Before we can do anything with the files we need to list them out so we can access them. You can use any method you like to do this but the following code uses glob to get you a list of PDF files and send them to a view. This example also uses a config value that points to a directory that contains our PDF files.

1
2
3
4
5
6
7
8
9
10
11
$config = Zend_Registry::get('config');
$globOut = glob($config->filesDirectory . '*.pdf');
if (count($globOut) > 0) { // make sure the glob array has something in it
    $files = array();
    foreach ($globOut as $filename) {
        $files[] = $filename;
    }
    $this->view->files = $files;
} else {
    $this->view->message = 'No files found.<br />';
}

The view might look something like this.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<?php
if ($this->message) {
    echo $this->message;
}
 
if (is_array($this->files)) {
    echo '<ul>';
    foreach ($this->files as $file) {
        echo '<li>'.$file.
            ' <a href="'.$this->url(array('controller'=>'pdf',
                'action'=>'viewmeta',
                'file' => urlencode($file)),'default', true).'" title="View PDF Meta">View Meta</a>' .
            ' <a href="'.$this->url(array('controller'=>'pdf',
                'action'=>'editmeta',
                'file' => urlencode($file)),'default', true).'" title="Edit PDF Meta">Edit Meta</a>'
            .'</li>';
    }
    echo '</ul>';
}

Viewing PDF Meta Data

Viewing PDF meta data is quite easy, especially using the Zend_Pdf object. The first step is loading the PDF into a data structure that the Zend_Pdf class can work with. The second step is to extract the properties you are interested in by using the properties array of the Zend_Pdf object. The following code example is an action taken from a controller that will extract all of the needed meta data from the PDF document (assigned by the $metaValues array) and pass them to the view for printing. The location of the PDF document is encoded (using urlencode) into the URL as the file property. The properties array is a public associative array that allows you to store and retrieve the meta data information. Not every meta data value will allways be set so we first make sure that the key exists in our properties array before trying to access it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
public function viewmetaAction()
{
    $pdfPath = urldecode($this->_request->getParam('file'));
    $pdf = Zend_Pdf::load($pdfPath);
 
    $metaValues = array('Title'       => '',
                        'Author'      => '',
                        'Subject'      => '',
                        'Keywords'     => '',
                        'Creator'      => '',
                        'Producer'     => '',
                        'CreationDate' => '',
                        'ModDate'      => '',
    );
 
    foreach ($metaValues as $meta => $metaValue) {
        if (isset($pdf->properties[$meta])) {
            $metaValues[$meta] = $pdf->properties[$meta];
        } else {
            $metaValues[$meta] = '';
        }
    }
 
    $this->view->file = $pdfPath;
    $this->view->metaValues = $metaValues;
}

The $metaValues array contains a number of keys which are used as the standard set of meta data information by Zend_Pdf. For more information on the detaulf values see the Zend_Pdf meta data information on the Zend Framework website. It is perfectly possible to create as many values as you want and add these to the other meta data information. I will come onto this in the next section where we edit this meta data. I will also leave the associated action view creation up to the reader as it shouldn't be too hard.

Editing PDF Meta Data

First off we will create a form that will allow us to edit the meta data of our PDF document. This form only contains one item (the title) but adding the others isn't a difficult process, just copy and paste the title element and change the parameters to suit your needs. Using a form here is useful as we can use the built in form validation features of Zend Form to prevent users from entering silly data into our PDF's. The only thing to make sure of is that the name of each field (the second parameter) matches the name of the metadata key.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Form_PdfMeta extends Zend_Form
{
    public function init()
    {
        // set the method for the display form to POST
        $this->setMethod('post');
 
        // Add text area element for title.
        $this->addElement('textarea', 'Title', array(
                          'label'      => 'Title',
                          'required'   => true,
                          'rows'       => '10',
                          'cols'       => '50',
                          'filters'    => array('StringTrim'),
                          'validators' => array(
                                          array('validator' => 'StringLength', 'options' => array(0,500)))
                          ));
                          
        // add the submit button
        $this->addElement('submit', 'submit', array('label' => 'Save'));                          
    }
}

With our form now created we now need to create our controller action that allows us to edit the meta data. Rather than go through every line and explain what is going on I have commented the code to make it more understandable. One thing to watch out for is what happens if the PDF document doesn't exist. You might want to issue a warning to the view, redirect the user to an error page or simply throw an exception. However, this is an exercise for the reader.

The SLASH constant contains a cross platform directory slash, which I have written about previously. This helps us find the name of the file from the URL string.

The $metaValues array contains the keys of the values you want to edit. You can enter any number of elements here, the only catch is that you also need to create form elements for each entry. The more adventurous of you could create a form class that built the form around the elements in this array.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
public function editmetaAction()
{
    // Get the form and send to the view.
    $form = new Form_PdfMeta();
    $this->view->form = $form;
 
    // Get the file and send the location to the view.
    $pdfPath          = urldecode($this->_request->getParam('file'));
    $file             = substr($pdfPath, strrpos($pdfPath, SLASH)+1);
    $this->view->file = $file;    
    
    // Define what meta data we are looking at.
    $metaValues = array('Title' => '');
 
    if ($this->_request->isPost()) {
        // Get the current form values.
        $formData = $this->_request->getPost();
        if ($form->isValid($formData)) {
            // Form values are valid.
       
            // Save the contents of the form to the associated meta data fields in the PDF.
            $pdf = Zend_Pdf::load($pdfPath);
            foreach ($metaValues as $meta => $metaValue) {
                if (isset($formData[$meta])) {
                    $pdf->properties[$meta] = $formData[$meta];
                } else {
                    $pdf->properties[$meta] = '';
                }
            }
            $pdf->save($pdfPath);
            
            // Redirect the user to the list action of this controller.
            return $this->_helper->redirector('list', 'pdf', '', array())->setCode(301);
        } else {
            // Form values are not valid send the current values to the form.
            $form->populate($formData);
        }
    } else {
        // Make sure the file exists before we start doing anything with it.
        if (file_exists($pdfPath)) {
            // Extract any current meta data values from the PDF document
            $pdf = Zend_Pdf::load($pdfPath);
            foreach ($metaValues as $meta => $metaValue) {
                if (isset($pdf->properties[$meta])) {
                    $metaValues[$meta] = $pdf->properties[$meta];
                } else {
                    $metaValues[$meta] = '';
                }
            }
            // Populate the form with out metadata values.
            $form->populate($metaValues);            
        } else {
            // File doesn't exist.
        }
    }
}

Our view for this action would look like the following code block. This view will print out the form on the left (taking care of any values assigned during the action) and show a view of the PDF on the right. This means that our users will be able to see the PDF document they are editing the meta data of, which should help with providing a good search service.

1
2
3
4
5
6
<div style="float:left;width:39%;">
<?php echo $this->form; ?>
</div>
<iframe src="<?php echo $this->baseUrl();?>/files/<?php echo $this->file; ?>" width="60%"
style="height:50em" align="right"
></iframe>

That's pretty much it really, if you want all the source code to this project I will be posting it as a downloadable zip file on the last post, so stay tuned to grab it.

In the next Zend Lucene and PDF Documents post I will be looking at how to extract the contents of the PDF document so we can also index this along with the metadata.

Category: 
philipnorton42's picture

Philip Norton

Phil is the founder and administrator of #! code and is an IT professional working in the North West of the UK.
Google+ | Twitter

Comments

What about a licence remark? Is this reuseable for other open source purposes?

Add new comment