Convert A sitemap.xml File To A HTML Sitemap With PHP

Note: This post is over two years old and so the information contained here might be out of date. If you do spot something please leave a comment and we will endeavour to correct.

13th August 2008 - 6 minutes read time

I have already talked about converting a sitemap.xml file into a urllist.txt file, but what if you want to create a HTML sitemap? If you have a sitemap.xml file then you can use this to spider your site, scrape the contents of each page and populate the HTML file with this information.

The following code does this. For every page it looks for the title tag, the description meta tag and the first h2 tag on the page. These items are then used to construct a segment of HTML for that page.

<?php
$header = '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>HTML Sitemap</title>
</head>
<body>';
 
set_time_limit(400);
 
$currentElement = '';
$currentLoc = '';
 
$map = "<h1>HTML Sitemap</h1>"."\n";
 
function parsePage($data)
{
 global $map;
 /*
 if you want to trap a certain file extention then use the syntax below...
 stripos($data, ".php")>0
 stripos($data, ".htm")>0
 stripos($data, ".asp")>0
 */
 if ( stripos($data,".pdf") > 0 ) {
  // if the url is a pdf document.
  $map .= '<p><a href="'.$data.'">PDF document.</a></p>'."\n";
  $map .= '<p>A pdf document.</p>'."\n";
 } elseif ( stripos($data, ".txt")>0 ) {
  // if the url is a text document
  $map .= '<p><a href="'.$data.'">Text document.</a></p>'."\n";
  $map .= '<p>A text document.</p>'."\n";
 } else {
  // try to open it anyway...
  // make sure that you can read the file
  if ( $urlh = @fopen($data, 'rb') ) {
   $contents = '';
   //check php version
   if ( phpversion()>5 ) {
    $contents = stream_get_contents($urlh);
   } else {
    while ( !feof($urlh) ) {
     $contents .= fread($urlh, 8192);
    };
   };
 
   // find the title
   preg_match('/(?<=\<[Tt][Ii][Tt][Ll][Ee]\>)\s*?(.*?)\s*?(?=\<\/[Tt][Ii][Tt][Ll][Ee]\>)/U', $contents, $title);
   $title = $title[0];
 
   // find the first h1 tag
   $header = array();
   preg_match('/(?<=\<[Hh]2\>)(.*?)(?=\<\/[Hh]2\>)/U', $contents, $header);
   $header = strip_tags($header[0]);
 
   if ( strlen($title) > 0 && strlen($header) > 0 ) {
    // print the title and h1 tag in combo
    $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.(strlen($header)>0?trim($header):trim($title)).'">'.trim($title).(strlen($header)>0?" - ".trim($header):'').'</a></p>'."\n";
   } elseif ( strlen($title) > 0 ) {
    $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.trim($title).'">'.trim($title).'</a></p>'."\n";
   } elseif ( strlen($header) > 0 ) {
    $map .= '<p class="link"><a href="'.str_replace('&','&amp;',$data).'" title="'.trim($header).'">'.trim($header).'</a></p>'."\n";
   };
 
   // find description
   preg_match('/(?<=\<[Mm][Ee][Tt][Aa]\s[Nn][Aa][Mm][Ee]\=\"[Dd]escription\" content\=\")(.*?)(?="\s*?\/?\>)/U', $contents, $description);
   $description = $description[0];
 
   // print description
   if ( strlen($description)>0 ) {
    $map .= '<p class="desc">'.trim($description).'</p>'."\n";
   };
   // close the file
   fclose($urlh);
  };
 };
};
 
/////////// XML PARSE FUNCTIONS HERE /////////////
// the start element function
function startElement($xmlParser, $name, $attribs)
{
 global $currentElement;
 $currentElement = $name;
};
 
// the end element function
function endElement($parser, $name)
{
 global $currentElement,$currentLoc;
 if ( $currentElement == 'loc') {
  parsePage($currentLoc);
  $currentLoc = '';
 };
 $currentElement = '';
};
 
// the character data function
function characterData($parser, $data) 
{
 global $currentElement,$currentLoc;
 // if the current element is loc then it will be a url
 if ( $currentElement == 'loc' ) {
  $currentLoc .= $data;
 };
};
 
// create parse object
$xml_parser = xml_parser_create();
// turn off case folding!
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, false);
// set start and end element functions
xml_set_element_handler($xml_parser,"startElement", "endElement");
// set character data function
xml_set_character_data_handler($xml_parser, "characterData");
 
// open xml file
if ( !($fp = fopen('sitemap.xml', "r")) ) {
 die("could not open XML input");
};
 
// read the file - print error if something went wrong.
while ( $data = fread($fp,4096) ) {
 if ( !xml_parse($xml_parser, $data,feof($fp)) ) {
  die(sprintf("XML error: %s at line %d",xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)));
 };
};
 
// close file
fclose($fp);
 
$footer = '</body>
</html>';
 
// write output to a file
$fp = fopen('sitemap.html', "w+");
fwrite($fp,$header.$map.$footer);
fclose($fp);
 
// print output
echo $header.$map.$footer;

This script prints out the sitemap and also saves the sitemap to a file for later use. This is essential as the script can take a long time to run due to all of the page accessing that it has to do.

This script is fairly complicated and has gone through several versions since I first created it so if you find any improvements or bugs then let me know and I will incorporate them.

PHP

sitemap

sitemap.xml

Comments

amazing script! but what is the file it creates so it dosen't have to keep recreating the same sitemap ? I can't seem to find it after I load the html sitemap thanks ! john

Submitted by john on Wed, 07/28/2010 - 13:06

Permalink

The file is called sitemap.html. The write function is on line 137. Although the script is quite old now so I would probably have written it in a different way if I did the same thing now.

Philip Norton

Submitted by philipnorton42 on Wed, 07/28/2010 - 20:07

Permalink

This script is incredible, I was wondering if you could help me modify it so that it does a pagination, like http://example.com/sitemap.php?page=1 (After 1000 Links on the Page) = http://example.com/sitemap.php?page=2 etc. is it possible? I checked the whole internet and could not find any, the only things people offer are site scrapers that make a sitemap in xml, i need the reverse xD

Submitted by mikula on Tue, 08/21/2012 - 06:45

Permalink

Add new comment

Question

What does the following code print out?

function arrayPrint($array)
{
   echo implode(' ', $array);
}

$arrayA = [1, 2, 3];
$arrayB = $arrayA;
$arrayB[1] = 0;
arrayPrint($arrayA);

Creating An Authentication System With PHP and MariaDB

15th October 2023

Using frameworks to handle the authentication of your PHP application is perfectly fine to do, and normally encouraged. They abstract away all of the complexity of managing users and sessions that need to work in order to allow your application to function.

Creating Sparklines In PHP

17th September 2023

A sparkline is a very small line graph that is intended to convey some simple information, usually in terms of a numeric value over time. They tend to lack axes or other labels and are added to information readouts in order to expand on numbers in order to give them more context.

PHP:CSI - Improving Bad PHP Logging Code

6th August 2023

I read The Daily WTF every now and then and one story about bad logging code in PHP stood out to me. The post looked at some PHP code that was created to log a string to a file, but would have progressively slowed down the application every time a log was generated.

Convert A sitemap.xml File To A HTML Sitemap With PHP

Comments

Add new comment

Related Content

Recreating Spotify Wrapped In PHP

Should A Constructor Throw An Exception?

PHP Question: Variable Reference

Question

Creating An Authentication System With PHP and MariaDB

Creating Sparklines In PHP

PHP:CSI - Improving Bad PHP Logging Code