Convert HTML To ASCII With PHP

26th April 2008 - 3 minutes read time

The reverse of turning ASCII text into HTML is to convert HTML into ASCII. And to this end here is a little function that does this.

function html2ascii($s) {
 // convert links
 $s = preg_replace('/<a\s+.*? href="?([^\">]*)"?[^>]*>(.*?)<\/a>/i','$2 ($1)',$s);
 
 // convert p, br and hr tags
 $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
 $s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
 $s = preg_replace('@<div[^>]*>(.*)@i',"\n".'$1'."\n",$s);  
  
 // convert bold and italic tags
 $s = preg_replace('@<b[^>]*>(.*?)@i','*$1*',$s);
 $s = preg_replace('@<strong[^>]*>(.*?)@i','*$1*',$s);
 $s = preg_replace('@<i[^>]*>(.*?)@i','_$1_',$s);
 $s = preg_replace('@<em[^>]*>(.*?)@i','_$1_',$s);
   
 // decode any entities
 $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
 
 // decode numbered entities
 $s = preg_replace('/&#(\d+);/e','chr(str_replace(";", "", str_replace("&#","","$0")))', $s);
 
 // strip any remaining HTML tags
 $s = strip_tags($s);
 
 // return the string
 return $s;
}

To use this function just pass it a string. Here is an example of it at work.

$htmlString = '<p>This is some <strong>XHTML</strong> markup that <em>will</em> be<br />
turned <a href="http://www.hashbangcode.com/" title="#! code">into</a> an ascii string.</p>';

echo html2ascii($htmlString);

Produces the following output.

This is some *XHTML* markup that _will_ be
turned into (http://www.hashbangcode.com/) an ascii string

Update:

It turns out the the use of the 'e' flag in preg_replace() isn't valid any more, so you need to use preg_replace_callback() instead. Also, the function needs to be passed a callback instead of just having a string.

I've updated the function a little here:

function html2ascii($s) {
  // convert links
  $s = preg_replace('/<a\s+.*? href="?([^\">]*)"?[^>]*>(.*?)<\/a>/i', '$2 ($1)', $s);

  // convert p, br and hr tags
  $s = preg_replace('@<(b|h)r[^>]*>(?=\<)@i', "\n", $s);
  $s = preg_replace('@<p[^>]*>(?=\<)@i', "\n\n", $s);
  $s = preg_replace('@<div[^>]*>(.*)(?=\<)@i', "\n" . '$1' . "\n", $s);

  // convert bold and italic tags
  $s = preg_replace('@<b[^>]*>(.*?)(?=\<)@i', '*$1*', $s);
  $s = preg_replace('@<strong[^>]*>(.*?)(?=\<)@i', '*$1*', $s);
  $s = preg_replace('@<i[^>]*>(.*?)(?=\<)@i', '_$1_', $s);
  $s = preg_replace('@<em[^>]*>(.*?)(?=\<)@i', '_$1_', $s);

  // decode any entities
  $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));

  // decode numbered entities
  $s = preg_replace_callback('/&#(\d+);/', function($matches) {
    return chr(str_replace(";", "", str_replace("&#", "",$matches[0])));
  }, $s);

  // strip any remaining HTML tags
  $s = strip_tags($s);

  // return the string
  return $s;
}

However, I really wouldn't use regular expressions to parse HTML.

Comments

Permalink
I got error at in line 19 --> $s = preg_replace('//e','chr(\\1)',$s); Warning: Wrong parameter count for chr() in C:\PHP-test\xxxxx.php (??) : regexp code on line 1

Marsel (Fri, 11/21/2008 - 03:29)

Permalink
You are quite right, that would never work! I have updated the script with the fix.
Permalink

preg_replace_callback(): Requires argument 2, 'chr(str_replace(";", "", str_replace("&#","","$0")))', to be a valid callback

Tradesouthwest (Sat, 08/07/2021 - 02:15)

Permalink

Thanks for the info Tradesouthwest. I have updated the post with some changes.

Add new comment

The content of this field is kept private and will not be shown publicly.