Zend Lucene And PDF Documents Part 5: Conclusion

If you have been following the last four posts you should now have an application that will allow you to view and edit PDF metadata, extract the document contents for search indexing, and allow users to search that index.

The one final thing to do is to sort out what happens when any PDF metadata is changed. At the moment the application will allow us to change the metadata as much as we like, but these changes will not be replicated in our search index. In order to do this we have to fully re-index everything. This is obviously the wrong way to go about things, and the solution is quite simple. All we need to do is up the file controllers/PdfController.php and change the editmetaAction() method so that when the PDF metadata is saved, the search index is updated. Add the following code to the editmetaAction() method, just before the redirect.

// Add to/update index.
$config = Zend_Registry::get('config');
$appLucene = App_Search_Lucene::open($config->luceneIndex);
$index = App_Search_Lucene_Index_Pdfs::index($pdfPath, $appLucene);

This will mean that when the metadata of any PDF is changed, so is the search index for that file.

Adding documents to the Lucene index will cause the index to segment into smaller pieces, this causes future searching and indexing to slow down. This essentially happens because when a new document is added to the index, a new segment is created. To fix this Lucene comes with a method called optimize(), which will reduce all of these segmented index files into a single segment. Running this method is quite simple, just open the Lucene index and call the optimize() method.

$config = Zend_Registry::get('config');
$index = App_Search_Lucene::open($config->luceneIndex);
 
// Optimize index.
$index->optimize();

This method can be run at any time, but because it's very resource intensive it is best not to run it every time you update the index. When to run it depends on how often you update your index. If you are only updating your search index once a week or so then run it once a week. If you are updating the index every few minutes then you should probably run optimize() either at set periods of the day, or when the site has a low level of traffic.

To obtain the source code for this project please visit the PDFSearch repository on github and download it from there. To keep down the size of the download I have removed the Zend Framework from the library directory. To get the project up and running just download the framework from framework.zend.com and put the Zend directory into the library directory. You will also need to open up the file application/configs/application.ini and change the references to your lucene and file directory locations.

I have used github to make it easier for me to update the project, but also to allow other users to fork it into their own projects and continue working in it. Remember that this application just covers the basic things needed to get PDF metadata editing, searching and indexing up and running. It is by no means a fully complete application and should not be used as such.

If you would like to know more about Zend_Search_Lucene then I would suggest buying Zend Framework In Action. This is an excellent introduction to the subject as well as a excellent introduction to Zend Framework. Some of the code I have used in this project was based upon the examples in this book.

Of course you can also look at the excellent documentation regarding Zend_Search_Lucene on the Zend Framework documentation website.

Comments

Many thanks for these explanations and the code - much appreciated.

Permalink
Thanks for such an excellent tutorial and code. But whenever I m running the project I m getting 0 results found eventhough pdfs are indexed. Pls help what is missing from my side ?
Permalink

Add new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
2 + 6 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.