Excel Document Scanning With Zend_Search_Lucene

11th May 2009

Zend_Search_Lucene offers some powerful document scanning capabilities, and there are a few different formats that are useful for the search engine to index.

To allow the indexing and searching of Excel documents using Zend_Search_Lucene you need to use the Zend_Search_Lucene_Document_Xlsx class. However, to use this class you must have the Zip module installed with PHP. For Windows users this means editing your php.ini file and uncommenting the following line:

extension=php_zip.dll

For Linux users you will need to recompile PHP with the --enable-zip configure option.

Create and/or open the index in the normal way and you can index Excel documents using the following code.

  1. $filename = 'C:\Book1.xlsx';
  2. $doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
  3. $index->addDocument($doc);

You can now set up a query and search for the document in the following way, although you would normally expect the input string to be some kind of user input.

  1. $queryStr = 'wibble';
  2. $userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
  3.  
  4. $query = new Zend_Search_Lucene_Search_Query_Boolean();
  5. $query->addSubquery($userQuery, true);
  6.  
  7. $hits = $index->find($query);
  8.  
  9. foreach ( $hits as $hit ) {
  10. echo $hit->score.'<br />';
  11. echo $hit->filename.'<br />';
  12. }

The score is always returned with a hit object. Other parameters available to display are filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. However, some of these depend on the contents of the document. It is possible to add keywords and subjects to an Excel document, so if they are not present then you will need to check for the existence of that parameter before displaying it. The following code looks for the existence of the keyword parameter before trying to print it out.

  1. if ( isset($hit->keywords) ) {
  2. echo $hit->keywords.'<br />';
  3. }

By default, this function indexes the document meta data and will tokenise and store the tokens as an index. The loadXlsxFile() function has a second optional parameter which is by default set to false. If this is set to true the contents of the Excel document will be included in the index. You can then use the following code to print out the contents of the document.

echo $hit->body.'<br />';

Bear in mind that this output will not contain any row or column information and will therefore look like a dump of the data.

Comments

Permalink
I have been looking for a simple way to code the zend_framework for my excel documents and although other sites go into more detail (and more technical jargon), this article is by far the best and most simple to follow. Mnay thanks Carl

Carl Smith (Tue, 02/19/2013 - 12:14)

Add new comment

The content of this field is kept private and will not be shown publicly.