Excel Document Scanning With Zend_Search_Lucene

×

Error message

Warning: apc_store(): Potential cache slam averted for key 'www.hashbangcode.com::cache_bootstrap::variables' in DrupalAPCCache->set() (line 206 of /var/www/www.hashbangcode.com/2014-03-22-2105/docroot/sites/all/modules/contrib/apc/drupal_apc_cache.inc).

Monday, May 11, 2009 - 10:25

Zend_Search_Lucene offers some powerful document scanning capabilities, and there are a few different formats that are useful for the search engine to index.

To allow the indexing and searching of Excel documents using Zend_Search_Lucene you need to use the Zend_Search_Lucene_Document_Xlsx class. However, to use this class you must have the Zip module installed with PHP. For Windows users this means editing your php.ini file and uncommenting the following line:

extension=php_zip.dll

For Linux users you will need to recompile PHP with the --enable-zip configure option.

Create and/or open the index in the normal way and you can index Excel documents using the following code.

1
2
3
$filename = 'C:\Book1.xlsx';
$doc = Zend_Search_Lucene_Document_Xlsx::loadXlsxFile($filename);
$index->addDocument($doc);

You can now set up a query and search for the document in the following way, although you would normally expect the input string to be some kind of user input.

1
2
3
4
5
6
7
8
9
10
11
12
$queryStr = 'wibble';
$userQuery = Zend_Search_Lucene_Search_QueryParser::parse($queryStr);
 
$query = new Zend_Search_Lucene_Search_Query_Boolean();
$query->addSubquery($userQuery, true);
 
$hits = $index->find($query);
 
foreach ( $hits as $hit ) {
    echo $hit->score.'<br />';
    echo $hit->filename.'<br />';
}

The score is always returned with a hit object. Other parameters available to display are filename, title, subject, creator, keywords, description, lastModifiedBy, revision, modified, created. However, some of these depend on the contents of the document. It is possible to add keywords and subjects to an Excel document, so if they are not present then you will need to check for the existence of that parameter before displaying it. The following code looks for the existence of the keyword parameter before trying to print it out.

1
2
3
if ( isset($hit->keywords) ) {
    echo $hit->keywords.'<br />';
}

By default, this function indexes the document meta data and will tokenise and store the tokens as an index. The loadXlsxFile() function has a second optional parameter which is by default set to false. If this is set to true the contents of the Excel document will be included in the index. You can then use the following code to print out the contents of the document.

echo $hit->body.'<br />';

Bear in mind that this output will not contain any row or column information and will therefore look like a dump of the data.

Category: 
philipnorton42's picture

Philip Norton

Phil is the founder and administrator of #! code and is an IT professional working in the North West of the UK.
Google+ | Twitter

Comments

I have been looking for a simple way to code the zend_framework for my excel documents and although other sites go into more detail (and more technical jargon), this article is by far the best and most simple to follow.

Mnay thanks

Carl

Add new comment