Zend Lucene And PDF Documents Part 2: PDF Data Extraction

Monday, October 26, 2009 - 23:33

Last time we looked at viewing and saving meta data to PDF documents using Zend Framework. The next step before we try to index them with Zend Lucene is to extract the data out of the documents themselves. I should note here that we can't extract the data perfectly from every PDF document, we certainly can't extract any images or tables from the PDF into any recognisable text. There is a little issue with extracting the text because we are essentially looking at compressed data. The text isn't saved into the document, it is rendered into the document using a font. So what we need to do is extract this data into some format the Lucene can tokenize. Because we are just getting the text out of the document for our search index we can take a few short-cuts in order to get as much textual data out of the document as possible. All of this data might not be fully readable and we will definitely loose any formatting and images, but for the purposes we are using it for we don't really need it. The idea is that we can retrieve as much relevant and indexable content for Zend Lucene to tokenize. Also, it is not possible to extract the data out of encrypted PDF documents.

What we need to do first is set up some items so that we can simply use a PDF extraction service to do the hard work for us. This does mean a greater understanding of Zend Framework than the last post required. What we are going to do is register a namespace with Zend_Loader_Autoloader. This will allow us to create classes that we can keep in a tidy folder structure and are also automatically included when we need them. If you don't have one already, create a function called _initAutoload() or similar in your Bootstrap.php file. Then enter the following code (the whole class is included here for clarity). You might have already done this in your Zend Framework project so you can skip this step if that is the case.

1
2
3
4
5
6
7
8
class Bootstrap extends Zend_Application_Bootstrap_Bootstrap
{
 protected function _initAutoload()
 {
  $autoloader = Zend_Loader_Autoloader::getInstance();
  $autoloader->registerNamespace(array('App_'));
 }
}

What this does is to register a folder called App, which is located in our library folder, to be part of the Zend Framework autoloading functions. Create a class called App_Search_Helper_PdfParser and put it in the folder \library\App\Search\Helper\ like this:

1
2
3
4
5
6
7
--application
--library
----App 
------Search
--------Helper
----------PdfParser.php
----Zend

Now we can instansiate the object without having to worry about if it is included or not, the Zend Framework autoloader will simply look in the right place for the file by looking at the class name and include it for us. We will use this folder structure for the rest of the application and build upon it as we add classes.

What we need to do now is to create the code that will run over our PDF document and pick out the text. I have to admit that I didn't write this fully myself, it is the result of a couple of hours of picking bits and pieces of code from examples and applications so that I could do what I needed to do. I have tested this code with lots of different examples of PDF documents (about 50 from different resources) so it should be able to extract data from most PDF types. What this code essentially does is split the document into various different sections, and then try to uncompress each section that has a FlateDecode filter type. If the uncompression works (ie, we have some data) then add this to a string and continue, returning it once at the end of the document. I have also added some string manipulation to this code that will strip out any odd characters or white space that we don't need. Here is the class in full, again there is rather a lot of code here so I have commented it to make it clearer.

Also, because of the use of gzuncompress you will need a zip library present on your server for this to work properly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
class App_Search_Helper_PdfParser
{
    /**
     * Convert a PDF into text.
     *
     * @param string $filename The filename to extract the data from.
     * @return string The extracted text from the PDF
     */
    public function pdf2txt($data)
    {
        /**
         * Split apart the PDF document into sections. We will address each
         * section separately.
         */
        $a_obj = $this->getDataArray($data, "obj", "endobj");
        $j     = 0;
 
        /**
         * Attempt to extract each part of the PDF document into a "filter"
         * element and a "data" element. This can then be used to decode the
         * data.
         */
        foreach ($a_obj as $obj) {
            $a_filter = $this->getDataArray($obj, "<<", ">>");
            if (is_array($a_filter) && isset($a_filter[0])) {
                $a_chunks[$j]["filter"] = $a_filter[0];
                $a_data = $this->getDataArray($obj, "stream", "endstream");
                if (is_array($a_data) && isset($a_data[0])) {
                    $a_chunks[$j]["data"] = trim(substr($a_data[0], strlen("stream"), strlen($a_data[0]) - strlen("stream") - strlen("endstream")));
                }
                $j++;
            }
        }
 
        $result_data = NULL;
 
        // decode the chunks
        foreach ($a_chunks as $chunk) {
            // Look at each chunk decide if we can decode it by looking at the contents of the filter
            if (isset($chunk["data"])) {
                // look at the filter to find out which encoding has been used
                if (strpos($chunk["filter"], "FlateDecode") !== false) {
                    // Use gzuncompress but supress error messages.
                    $data =@ gzuncompress($chunk["data"]);
                    if (trim($data) != "") {
                        // If we got data then attempt to extract it.
                        $result_data .= ' ' . $this->ps2txt($data);
                    }
                }
            }
        }
        /**
         * Make sure we don't have large blocks of white space before and after
         * our string. Also extract alphanumerical information to reduce
         * redundant data.
         */
        $result_data = trim(preg_replace('/([^a-z0-9 ])/i', ' ', $result_data));
 
        // Return the data extracted from the document.
        if ($result_data == "") {
            return NULL;
        } else {
            return $result_data;
        }
    }
 
    /**
     * Strip out the text from a small chunk of data.
     *
     * @param  string $ps_data The chunk of data to convert.
     * @return string          The string extracted from the data.
     */
    public function ps2txt($ps_data)
    {
        // Stop this function returning bogus information from non-data string.
        if (ord($ps_data[0]) < 10) {
            return $ps_data;
        }
        if (substr($ps_data, 0, 8 ) == '/CIDInit') {
            return '';
        }
 
        $result = "";
 
        $a_data = $this->getDataArray($ps_data, "[", "]");
 
        // Extract the data.
        if (is_array($a_data)) {
            foreach ($a_data as $ps_text) {
                $a_text = $this->getDataArray($ps_text, "(", ")");
                if (is_array($a_text)) {
                    foreach ($a_text as $text) {
                        $result .= substr($text, 1, strlen($text) - 2);
                    }
                }
            }
        }
 
        // Didn't catch anything, try a different way of extracting the data
        if (trim($result) == "") {
            // the data may just be in raw format (outside of [] tags)
            $a_text = $this->getDataArray($ps_data, "(", ")");
            if (is_array($a_text)) {
                foreach ($a_text as $text) {
                    $result .= substr($text, 1, strlen($text) - 2);
                }
            }
        }
 
        // Remove any stray characters left over.
        $result = preg_replace('/\b([^a|i])\b/i', ' ', $result);
        return trim($result);
    }
 
    /**
     * Convert a section of data into an array, separated by the start and end words.
     *
     * @param  string $data       The data.
     * @param  string $start_word The start of each section of data.
     * @param  string $end_word   The end of each section of data.
     * @return array              The array of data.
     */
    public function getDataArray($data, $start_word, $end_word)
    {
        $start    = 0;
        $end      = 0;
        $a_result = array();
 
        while ($start !== false && $end !== false) {
            $start = strpos($data, $start_word, $end);
            $end   = strpos($data, $end_word, $start);
            if ($end !== false && $start !== false) {
                // data is between start and end
                $a_result[] = substr($data, $start, $end - $start + strlen($end_word));
            }
        }
 
        return $a_result;
    }
}

To use this within your application just instanciate the object and call the pdf2txt() method, passing in the rendered PDF string as the parameter. Rather than get this object to open the file a second time (after first being opened to inspect the PDF data) I decided to use the Zend_Pdf object to transfer the data into the class. The following code shows how to load a PDF using Zend_Pdf and pass the rendered string to the pdf2txt() method.

1
2
3
$pdf = Zend_Pdf::load($pdfPath);
$pdfParse = new App_Search_Helper_PdfParser();
$contents = $pdfParse->pdf2txt($pdf->render());

What we should be left with after this process is a block of text that we can use in our search index.

In the next post I will tie together the meta data and the contents retrival and use them to index our PDF documents using Zend Lucene. Again I will make all of the source code available for this project in the final instalment, so stay tuned if you would like it.

Category: 
philipnorton42's picture

Philip Norton

Phil is the founder and administrator of #! code and is an IT professional working in the North West of the UK.
Google+ | Twitter

Comments

philipnorton42's picture
Submitted by philipnorton42 on Thu, 02/04/2010 - 21:36

Glad you found it useful! As to your question, it's the basic issue of relevance that is a major problem that every search engine must overcome. Major search engines use some sort of duplicate content filter to strip out some very common items. The main difference is that they probably have lots more computer power (and time) than you or me. One idea you might want to think about is to compile a list of phrases that you would tell the indexer to ignore. These would be phrases in order to get past the issue described above where you would pick out single words. You could even use some regular expression to reduce the workload. For example, to match something like "Copyright Company name 2010." use something like this: Copyright .* 20\d\d\. You can pass an array of patterns to a single call to preg_replace() in order to replace them all with nothing. Let me know what you come up with. :)
Hi, Thanks for this wonderful tutorial.. Very informative and well written. It really helped me out with my project! I do however have a question on which I would appreciate your response - After converting a pdf to text, there is a LOT of copyright info, font info etc at the bottom of the content.. is there any reliable way to get rid of it? The problem is that there is a lot of "computer related" verbiage in there that gets indexed, and searching for something like "verisign" or "computer" or "microsoft" produces a hit on EVERY indexed pdf. My current method of elimination is locating the first 'Copyright ' and getting rid of everything after that. I am however concerned someone might actually have a 'Copyright ' in their pdf and cause content to be missed. Thoughts?

substr<span style="color: #000000;">(</span><span style="color: #aa7700;">$chunk</span><span style="color: #000000;">[</span><span style="color: #0000ff;">"filter"</span><span style="color: #000000;">], </span><span style="color: #0000ff;">"FlateDecode"</span><span style="color: #000000;">) causes an error. Any ideas? Is this a valid substr-command?</span>

<span lang="DE">&nbsp;</span>

&nbsp;

expects parameter 2 to be long, string given in

 

philipnorton42's picture
Submitted by philipnorton42 on Mon, 04/11/2011 - 14:45

No, that was incorrect, it should have been a call to strpos() instead of substr(). I have corrected that now.

Add new comment