Articles

Zend Lucene And PDF Documents Part 2: PDF Data Extraction

Last time we looked at viewing and saving meta data to PDF documents using Zend Framework. The next step before we try to index them with Zend Lucene is to extract the data out of the documents themselves. I should note here that we can't extract the data perfectly from every PDF document, we certainly can't extract any images or tables from the PDF into any recognisable text. There is a little issue with extracting the text because we are essentially looking at compressed data. The text isn't saved into the document, it is rendered into the document using a font. So what we need to do is extract this data into some format the Lucene can tokenize. Because we are just getting the text out of the document for our search index we can take a few short-cuts in order to get as much textual data out of the document as possible. All of this data might not be fully readable and we will definitely loose any formatting and images, but for the purposes we are using it for we don't really need it. The idea is that we can retrieve as much relevant and indexable content for Zend Lucene to tokenize. Also, it is not possible to extract the data out of encrypted PDF documents.

Zend Lucene And PDF Documents Part 1: PDF Meta Data

Zend Lucene is a powerful search engine, but it does take a bit of setting up to get it working properly. One thing that I have had trouble getting up and running in the past is indexing and searching PDF documents. The difficulty here is that it isn't immediately apparent how you can index the contents of a PDF document with ease. I came across a couple of functions you can try out, but even if that doesn't work it is possible to create and edit PDF meta data using the Zend_Pdf library. Because there is a lot to cover on this subject I thought I would create a blog post in multiple parts. For this post I will be looking at how to add and edit this meta data. This meta data can be used to classify your PDF documents and allow you to index them and provide a decent search solution using Zend Lucene.

Cross Platform Directory Slashes In PHP

I'm not sure where I found this, but I have been using it on a few projects recently and it's helped a lot. It basically detects what system you are on and will give you a constant that keeps hold of the slash for that system.

if (strtoupper(substr(PHP_OS,0,3)) == 'WIN') {
 // Windows
 define('SLASH', '\\');
} else {
 // Linux/Unix 
 define('SLASH', '/');
}

For example, on a Windows system a file might be in C:\folders\data\, whereas on Linux the file would be in /folders/data/. So if you are given the full path as a string it can be difficult to separate the filename from the directory without knowing what system you are on.

Wordpress DoS Attack Script Solution

There is a script knocking about on the internet at the moment that allows an attacker to run some code that will bring your Wordpress blog to its knees. This will more than likely cause your host to get annoyed as well.

What it does it performs a trackback request to the file wp-trackback.php, but it sends a massive (over 200,000 characters) string that Wordpress will take at face value and accept as a legitimate trackback. The first time this is run Wordpress will write it to the database, but the every time after that it will run a select query to see if the trackback exists. Even though this isn't a legitimate trackback Wordpress will still process it on every request, causing a massive overhead as each large string is processed.

One solution is to simply stop access to the offending file by using an Apache rule in your .htaccess file to prevent all access to this file.

PHPNW09 A Review

Last weekend saw the second annual PHPNW conference, and it was an excellent conference. There were some 200 people attending the event and we got to see some interesting and informative talks. When I arrived at the talk I received a bag with some brochures in it as well as a KitKat (which I ate for breakfast) and a years subscription to PHP|Architect. Everyone at the conference was also fed very well for lunch and dinner and Sun sponsored a free bar at the end of the first day, which was nice.

What I thought I'd do is go through each of the talks that I attended and copy in my responses from the joind.in reviews that I have been posting during the week, but also embellish them with further thoughts and comments. Also, joind.in seem to have deleted one or two of my reviews so I will have to write them from scratch anyway.

PHP instanceof Operator

The instanceof operator is used in PHP to find out if an object is an instantiated instance of a class. It's quite easy to use and works in the same sort of way as other operators. This can be useful to controlling objects in large applications as you can make sure that a parameter is a particular instance of an object before using it. Lets create a couple of classes as examples.

class Shape
{
}

class Circle extends Shape
{
}

Now, to find out if a variable is an instance of the Shape class create the object and then use the instanceof operator on the object. The following code returns true.

PHP: The Second Bracket Is Optional

When writing PHP class or function (basically any file containing only PHP code) files you might have learnt to write them something like this:

<?php 
class Users
{
}
?>

However, did you know that the second bracket is optional? The following class file is perfectly legal:

<?php 
class Users
{
}

This practice is actually a good thing to do for a very good reason, it will stop any white space appearing at the bottom of your files, which can cause header errors. In fact, missing out the second brace is part of the Zend Framework coding standard for this very reason, so it is a good habit to get into.

MySQL Order Table By Character Length

As part of debugging a bit of code I needed to know the longest possible field lengths that a record contains. You might need to know this if you are performing a database migration. The following query returns a field, along with the length of the string, and orders the results by the number of characters in that string.

SELECT field, CHARACTER_LENGTH(field) as fieldCharacterCount
FROM table
ORDER BY fieldCharacterCount DESC

 

Advanced Use Of PHP Function strtotime()

Finding the next day of the week from a given date can involve some complicated loops and if statements. In PHP it is made quite easy through the use of the strtotime() function. This function, which is part of the PHP core since version 4, can take just about any string representation of the time and convert it into a Unix timestamp.

The most common use of strtotime() is to convert a string into a time. Here are some random examples, the first two convert a date and the second two print more or less the same, but (obviously) will change with time. I say more or less as the "now" parameter will return the current timestamp whereas the "today" parameter will return the timestamp from the beginning of the current day.

Password Validation Class In PHP

When validating a password it is easy enough to make sure that the password is of a certain length, but what happens if you want to make sure that the password has at least one number, or contains a mixture of upper and lowercase letters? I recently had to validate a password like this and so I created a password validation class that allows easy validation of a password string to a set of given parameters, but also allow these parameters to be changed when needed.

Download the Password class here.

The Password class takes a set of parameters, which can be altered at runtime, and uses these to validate a password. To run the password validator with default parameters do the following: