pdf

Generating A PDF From A Web Page Using PHP And Chrome

8th May 2022 - 16 minutes read time

Generating a PDF document from a web page through PHP can be problematic. It's often something that seems quite simple, but actually generating the document can be difficult and time consuming.

There are a number of libraries available that allow you to generate PDF documents, but in reality they solve one problem and introduce several more. Packages exist as either HTML to PDF converters or allow you to generate a PDF by placing elements into the document.

Generating a PDF manually is not normally a good way to go just due to it being a very time consuming process and is difficult to test. You will probably have to spend time creating a rendering engine just for this purpose, and that will take time to create and maintain.

Zend Lucene And PDF Documents Part 2: PDF Data Extraction

26th October 2009 - 17 minutes read time

Last time we looked at viewing and saving meta data to PDF documents using Zend Framework. The next step before we try to index them with Zend Lucene is to extract the data out of the documents themselves. I should note here that we can't extract the data perfectly from every PDF document, we certainly can't extract any images or tables from the PDF into any recognisable text. There is a little issue with extracting the text because we are essentially looking at compressed data. The text isn't saved into the document, it is rendered into the document using a font. So what we need to do is extract this data into some format the Lucene can tokenize. Because we are just getting the text out of the document for our search index we can take a few short-cuts in order to get as much textual data out of the document as possible. All of this data might not be fully readable and we will definitely loose any formatting and images, but for the purposes we are using it for we don't really need it. The idea is that we can retrieve as much relevant and indexable content for Zend Lucene to tokenize. Also, it is not possible to extract the data out of encrypted PDF documents.

Zend Lucene And PDF Documents Part 1: PDF Meta Data

22nd October 2009 - 16 minutes read time

Zend Lucene is a powerful search engine, but it does take a bit of setting up to get it working properly. One thing that I have had trouble getting up and running in the past is indexing and searching PDF documents. The difficulty here is that it isn't immediately apparent how you can index the contents of a PDF document with ease. I came across a couple of functions you can try out, but even if that doesn't work it is possible to create and edit PDF meta data using the Zend_Pdf library. Because there is a lot to cover on this subject I thought I would create a blog post in multiple parts. For this post I will be looking at how to add and edit this meta data. This meta data can be used to classify your PDF documents and allow you to index them and provide a decent search solution using Zend Lucene.