PHP

Posts about the server side scripting language PHP

Zend Lucene And PDF Documents Part 2: PDF Data Extraction

Last time we looked at viewing and saving meta data to PDF documents using Zend Framework. The next step before we try to index them with Zend Lucene is to extract the data out of the documents themselves. I should note here that we can't extract the data perfectly from every PDF document, we certainly can't extract any images or tables from the PDF into any recognisable text. There is a little issue with extracting the text because we are essentially looking at compressed data. The text isn't saved into the document, it is rendered into the document using a font. So what we need to do is extract this data into some format the Lucene can tokenize. Because we are just getting the text out of the document for our search index we can take a few short-cuts in order to get as much textual data out of the document as possible. All of this data might not be fully readable and we will definitely loose any formatting and images, but for the purposes we are using it for we don't really need it. The idea is that we can retrieve as much relevant and indexable content for Zend Lucene to tokenize. Also, it is not possible to extract the data out of encrypted PDF documents.

Zend Lucene And PDF Documents Part 1: PDF Meta Data

Zend Lucene is a powerful search engine, but it does take a bit of setting up to get it working properly. One thing that I have had trouble getting up and running in the past is indexing and searching PDF documents. The difficulty here is that it isn't immediately apparent how you can index the contents of a PDF document with ease. I came across a couple of functions you can try out, but even if that doesn't work it is possible to create and edit PDF meta data using the Zend_Pdf library. Because there is a lot to cover on this subject I thought I would create a blog post in multiple parts. For this post I will be looking at how to add and edit this meta data. This meta data can be used to classify your PDF documents and allow you to index them and provide a decent search solution using Zend Lucene.

Cross Platform Directory Slashes In PHP

I'm not sure where I found this, but I have been using it on a few projects recently and it's helped a lot. It basically detects what system you are on and will give you a constant that keeps hold of the slash for that system.

if (strtoupper(substr(PHP_OS,0,3)) == 'WIN') {
 // Windows
 define('SLASH', '\\');
} else {
 // Linux/Unix 
 define('SLASH', '/');
}

For example, on a Windows system a file might be in C:\folders\data\, whereas on Linux the file would be in /folders/data/. So if you are given the full path as a string it can be difficult to separate the filename from the directory without knowing what system you are on.

PHP instanceof Operator

The instanceof operator is used in PHP to find out if an object is an instantiated instance of a class. It's quite easy to use and works in the same sort of way as other operators. This can be useful to controlling objects in large applications as you can make sure that a parameter is a particular instance of an object before using it. Lets create a couple of classes as examples.

class Shape
{
}

class Circle extends Shape
{
}

Now, to find out if a variable is an instance of the Shape class create the object and then use the instanceof operator on the object. The following code returns true.

PHP: The Second Bracket Is Optional

When writing PHP class or function (basically any file containing only PHP code) files you might have learnt to write them something like this:

<?php 
class Users
{
}
?>

However, did you know that the second bracket is optional? The following class file is perfectly legal:

<?php 
class Users
{
}

This practice is actually a good thing to do for a very good reason, it will stop any white space appearing at the bottom of your files, which can cause header errors. In fact, missing out the second brace is part of the Zend Framework coding standard for this very reason, so it is a good habit to get into.

Advanced Use Of PHP Function strtotime()

Finding the next day of the week from a given date can involve some complicated loops and if statements. In PHP it is made quite easy through the use of the strtotime() function. This function, which is part of the PHP core since version 4, can take just about any string representation of the time and convert it into a Unix timestamp.

The most common use of strtotime() is to convert a string into a time. Here are some random examples, the first two convert a date and the second two print more or less the same, but (obviously) will change with time. I say more or less as the "now" parameter will return the current timestamp whereas the "today" parameter will return the timestamp from the beginning of the current day.

Password Validation Class In PHP

When validating a password it is easy enough to make sure that the password is of a certain length, but what happens if you want to make sure that the password has at least one number, or contains a mixture of upper and lowercase letters? I recently had to validate a password like this and so I created a password validation class that allows easy validation of a password string to a set of given parameters, but also allow these parameters to be changed when needed.

Download the Password class here.

The Password class takes a set of parameters, which can be altered at runtime, and uses these to validate a password. To run the password validator with default parameters do the following:

Using list() With explode() In PHP

A simple way to convert a string into a set of variables is through the use of the explode() and list() functions. list() is a language construct (not really a function) that will convert an array into a list of variables. For example, to convert a simple array into a set of variables do the following:

list($variable1, $variable2) = array(1, 2);

In this example $variable1 now contains the value 1 and $variable2 contains the value 2. This can be adapted to use the explode() function to take a string and convert it into a set of variables. An example of this in use might be when dealing with addresses, simply explode the string using the comma and you have a set of variables.

$address = '123 Fake Street, Town, City, PO3T C0D3';
list($street, $town, $city, $postcode) = explode(',', $address);

You can now print out parts of the address like this:

Google Last Cached Date Finder In PHP

When Google looks at a page it takes a snapshot of that page and uses this to match against the query a user entered. To view these cached pages run a Google search and look at the Cached link next to the green URL text of the result. When you view the cached page Google will also give you a date that the page was last cached on. This can be used as a metric of your sites importance as the more often the site is cached, the more favourable Google views your page.

Taking a reading of this metric can therefore be useful, which is why I set about to create a class to retrieve this result.

Download the Google Cache class here.

Setting Locale In Zend Framework

Every application has a locale, even if that is just the locale of the author. Through the use of locals you can make your application aware of what sort of language, currency and even timezone that the user would like to see. In Zend Framework this is accoumplished via Zend_Locale.

There are many things to do with locale once, but first you need to determine where the user is based. To find this out you simply create a new instance of the Zend_Locale object. The following code will create the Zend_Locale object and print out the language and region of the user.

$locale = new Zend_Locale();
$language = $locale->getLanguage();
$region = $locale->getRegion();
echo $language . ' ' . $region;

What you see here depends on where you are and what language you are running on your machine. For example, a person living in Germany, who speaks German would see the following output.