Parsing XML with PHP

XML data extraction can be a common task, but to work directly with this data you need to understand how PHP parses XML. There are various different functions involved in parsing XML in PHP, all of which work together to extract data from a XML document. I will go through each of these functions and tie them together at the end.

xml_parser_create()

This function is used to create the parser object that will be used by the rest of the process. This object is used to store data and configuration options and is passed to each of the functions involved.

$xml_parser = xml_parser_create();

xml_set_element_handler()

Next we need to set up the functions that will be used in the parsing of the script. The xml_set_handler() method takes the following parameters:

  • XML parser reference: This is a reference to the parser that was created using the xml_parser_create function().
  • Start element: This is a callback reference to a function that will be called when a start element is found as the parser runs.
  • End element: This is a callback reference to a function that will be called when an end element is found as the parser runs.

The last two parameters need to be functions with specific footprints. This means that they need to have the correct parameter numbers, but you can call then whatever you want. Here is an example of the call to the function xml_set_element_handler().

xml_set_element_handler($xml_parser, "startElement", "endElement");

The startElement() and endElement() functions will be called automatically by the xml parser object when things are set in motion.

startElement() function

Above the call to the function xml_set_element_handler() you will need to have set out a method that will read start element data. The method must have the following parameters:

  • Parser: This is the xml parser object that was created in the call to xml_parser_create.
  • Name: The name of the start element.
  • Attribs: This is an associative array of attributes that the start element contains.

So your function might look something like this:

function startElement($xmlParser, $name, $attribs) {
    echo "Start: " . $name ."<br />";
}

All this will do is print off the name of the element, but you can do a lot more. For example, let say that one of your elements is called <title>, you can use an if or switch statement to store this value in a variable for use later. Like this:

function startElement($xmlParser, $name, $attribs) {
    global $variable;
    switch ($name) {
        case 'title':
            $variable = $name;
            break;
    }
}

Remember that you will need to put this function declaration BEFORE the call for xml_set_element_handler(), PHP needs to know about this method so that it can point the parser towards it.

endElement() function

This function is called when the parser encounters a xml closing element. In an opposite operation as before you might need to clear the variable you stored during the start element function. Again this decleration MUST be before the call for xml_set_element_handler. Note that if the tag is self closing then there will be no end element. The function must have the following parameters.

  • xml_parser: The parser created in the call to xml_parser_create.
  • name: The name of the element.

The following code will just print of the name of the end element, you can use this function to overright anything that may have happened in the startElement function. For example, you may have set a value in the startElement() to keep track of the depth of the parser into the XML document, you can use this method to reduce it. This might be important if there is more than one element with the same name, but in a different context.

function endElement($parser, $name)
{
	echo "End: " . $name . "<br />";
}

xml_set_character_data_handler()

The next function to call is xml_set_character_data_handler. This takes two parameters:

  • xml_parser: This is a callback reference to the xml parser that was created in the call to xml_parser_create.
  • characterData: This is a callback reference to the method that will be called when character data is found.

This function works in the same way as the xml_set_element_handler() function in that it simply sets a reference to the function that will be called when character data is encountered. The function is called like this.

xml_set_character_data_handler($xml_parser, "characterData");

characterData() function

The characterData() function, which again MUST be placed before the call to xml_set_character_data_handler() and must also have the following parameters.

  • xml_parser: The reference to the xml parser created in the call to xml_parser_create.
  • data: The data held within the XML element. Any CDATA tags have been used then the parser will return everything between those tags so no need to worry about cutting them out.

So when the parser object finds a data object this method is called. The following function will just print out the data.

function characterData($parser, $data) {
	echo "Data: " . $data . "<br />";
}

One thing that it is essential that you look out for is the funny thing that the parser does when it encounders certain conditions. It will stop parsing and call the function again. This repeats until all of the data has been passed. I've listed (I think) all of the conditions below.

  • The parser runs into an Entity Declaration, such as &amp; (&) or &#039; (')
  • The parser finishes parsing an entity.
  • The parser runs into the new-line character (\n)
  • The parser runs into a series of tab characters (\t)
  • The content of the $data parameter is more than 1024 (bytes).

The best way to explain this is to use an example. Lets say that you have the following string as part of the data.

some text&
some more text'
last bit of text

If you used the previous example method of just printing out the information then the parser will print out the following:

Data: some text
Data: &
Data: some more text
Data: '
Data: last bit of text

So be sure that when you call the method to make sure that all of the character data is passed through. One thing you could do is to have the characterData() function add the data to a string. The string is initialised when the startElement function is called and printed off when the endElement function is called.

xml_parser_set_option()

This method is optional and can be used if you want the parser to have a certain behaviour. For example, to turn off case folding on the parser use the following code.

xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, false);

Case folding is basically the turning of characters to their uppercase equivalent. However, in XML all tags must be lowercase so and for some reason the default of the parser is for this to be on. So if you create w3c valid XML make sure that you use this function to turn off case folding. Here is a list of the available options for this function.

  • XML_OPTION_CASE_FOLDING: (integer) Controls whether case-folding is enabled for this XML parser. Enabled by default.
  • XML_OPTION_SKIP_TAGSTART: (integer) Specify how many characters should be skipped in the beginning of a tag name.
  • XML_OPTION_SKIP_WHITE: (integer) Whether to skip values consisting of whitespace characters.
  • XML_OPTION_TARGET_ENCODING: (string) Sets which target encoding to use in this XML parser. By default, it is set to the same as the source encoding used by xml_parser_create(). Supported target encodings are ISO-8859-1, US-ASCII and UTF-8.

xml_parse()

This function is used to run the parser over some input. It takes the following parameters:

  • xml_parser: This is a xml parser object created in the xml_parser_create() function.
  • data: A chunk of data to parse. This can be read from a file or a stream.
  • end: (optional) If this is set to true then this is the last bit of data from the source and so this is the last time the function will be run.

As you can see the xml_parse() function can be run over and over again until all of the data has been read from the file.

if (!($fp = fopen("an_xmfile.xml", "r"))) {
    die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
    if  (!xml_parse($xml_parser, $data, feof($fp))){
        die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)));
    }
}

xml_parser_free()

As the name suggests this function is called at the end of the XML parsing run. It basically just clears up the memory and throws away the XML parser object created at the start.

Putting them all together

Just as an example I have put the code together into something that will spit out XML into formatted HTML, albeit a little ugly. It is designed to allow you to expand upon to create your own XML parsing script.

// the start element function
function startElement($xmlParser, $name, $attribs) {
	echo "Start: " . $name . "<br />";
}
// the end element function
function endElement($parser, $name) {
	echo "End: " . $name . "<br />";
}
function characterData($parser, $data) {
	echo "Data: " . $data . "<br />";
}
$xml_parser = xml_parser_create();
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, false);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!($fp = fopen("an_xml_file.xml","r"))) {
    die("could not open XML input");
}
while ($data = fread($fp, 4096)) {
    if (!xml_parse($xml_parser, $data, feof($fp))) {
        die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)));
    }
}

 

Add new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
13 + 3 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.