Zend Lucene And PDF Documents Part 5: Conclusion

Note: This post is over two years old and so the information contained here might be out of date. If you do spot something please leave a comment and we will endeavour to correct.

17th November 2009 - 6 minutes read time

If you have been following the last four posts you should now have an application that will allow you to view and edit PDF metadata, extract the document contents for search indexing, and allow users to search that index.

The one final thing to do is to sort out what happens when any PDF metadata is changed. At the moment the application will allow us to change the metadata as much as we like, but these changes will not be replicated in our search index. In order to do this we have to fully re-index everything. This is obviously the wrong way to go about things, and the solution is quite simple. All we need to do is up the file controllers/PdfController.php and change the editmetaAction() method so that when the PDF metadata is saved, the search index is updated. Add the following code to the editmetaAction() method, just before the redirect.

// Add to/update index.
$config = Zend_Registry::get('config');
$appLucene = App_Search_Lucene::open($config->luceneIndex);
$index = App_Search_Lucene_Index_Pdfs::index($pdfPath, $appLucene);

This will mean that when the metadata of any PDF is changed, so is the search index for that file.

Adding documents to the Lucene index will cause the index to segment into smaller pieces, this causes future searching and indexing to slow down. This essentially happens because when a new document is added to the index, a new segment is created. To fix this Lucene comes with a method called optimize(), which will reduce all of these segmented index files into a single segment. Running this method is quite simple, just open the Lucene index and call the optimize() method.

$config = Zend_Registry::get('config');
$index = App_Search_Lucene::open($config->luceneIndex);
 
// Optimize index.
$index->optimize();

This method can be run at any time, but because it's very resource intensive it is best not to run it every time you update the index. When to run it depends on how often you update your index. If you are only updating your search index once a week or so then run it once a week. If you are updating the index every few minutes then you should probably run optimize() either at set periods of the day, or when the site has a low level of traffic.

To obtain the source code for this project please visit the PDFSearch repository on github and download it from there. To keep down the size of the download I have removed the Zend Framework from the library directory. To get the project up and running just download the framework from framework.zend.com and put the Zend directory into the library directory. You will also need to open up the file application/configs/application.ini and change the references to your lucene and file directory locations.

I have used github to make it easier for me to update the project, but also to allow other users to fork it into their own projects and continue working in it. Remember that this application just covers the basic things needed to get PDF metadata editing, searching and indexing up and running. It is by no means a fully complete application and should not be used as such.

If you would like to know more about Zend_Search_Lucene then I would suggest buying Zend Framework In Action. This is an excellent introduction to the subject as well as a excellent introduction to Zend Framework. Some of the code I have used in this project was based upon the examples in this book.

Of course you can also look at the excellent documentation regarding Zend_Search_Lucene on the Zend Framework documentation website.

PHP

Zend Framework

lucene

Comments

Many thanks for these explanations and the code - much appreciated.

Submitted by Rico on Tue, 04/19/2011 - 21:38

Permalink

Thanks for such an excellent tutorial and code. But whenever I m running the project I m getting 0 results found eventhough pdfs are indexed. Pls help what is missing from my side ?

Submitted by Akash on Tue, 06/25/2013 - 12:13

Permalink

Add new comment

Question

What does the following code print out?

function arrayPrint($array)
{
   echo implode(' ', $array);
}

$arrayA = [1, 2, 3];
$arrayB = $arrayA;
$arrayB[1] = 0;
arrayPrint($arrayA);

Creating An Authentication System With PHP and MariaDB

15th October 2023

Using frameworks to handle the authentication of your PHP application is perfectly fine to do, and normally encouraged. They abstract away all of the complexity of managing users and sessions that need to work in order to allow your application to function.

Creating Sparklines In PHP

17th September 2023

A sparkline is a very small line graph that is intended to convey some simple information, usually in terms of a numeric value over time. They tend to lack axes or other labels and are added to information readouts in order to expand on numbers in order to give them more context.

PHP:CSI - Improving Bad PHP Logging Code

6th August 2023

I read The Daily WTF every now and then and one story about bad logging code in PHP stood out to me. The post looked at some PHP code that was created to log a string to a file, but would have progressively slowed down the application every time a log was generated.

Zend Lucene And PDF Documents Part 5: Conclusion

Comments

Add new comment

Related Content

Recreating Spotify Wrapped In PHP

Should A Constructor Throw An Exception?

PHP Question: Variable Reference

Question

Creating An Authentication System With PHP and MariaDB

Creating Sparklines In PHP

PHP:CSI - Improving Bad PHP Logging Code