The reverse of turning ASCII text into HTML is to convert HTML into ASCII. And to this end here is a little function that does this.
function html2ascii($s) {
// convert links
$s = preg_replace('/<a\s+.*? href="?([^\">]*)"?[^>]*>(.*?)<\/a>/i','$2 ($1)',$s);
// convert p, br and hr tags
$s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
$s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
$s = preg_replace('@<div[^>]*>(.*)@i',"\n".'$1'."\n",$s);
// convert bold and italic tags
$s = preg_replace('@<b[^>]*>(.*?)@i','*$1*',$s);
$s = preg_replace('@<strong[^>]*>(.*?)@i','*$1*',$s);
$s = preg_replace('@<i[^>]*>(.*?)@i','_$1_',$s);
$s = preg_replace('@<em[^>]*>(.*?)@i','_$1_',$s);
// decode any entities
$s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
// decode numbered entities
$s = preg_replace('/&#(\d+);/e','chr(str_replace(";", "", str_replace("&#","","$0")))', $s);
// strip any remaining HTML tags
$s = strip_tags($s);
// return the string
return $s;
}
To use this function just pass it a string. Here is an example of it at work.
$htmlString = '<p>This is some <strong>XHTML</strong> markup that <em>will</em> be<br />
turned <a href="http://www.hashbangcode.com/" title="#! code">into</a> an ascii string.</p>';
echo html2ascii($htmlString);
Produces the following output.
This is some *XHTML* markup that _will_ be
turned into (http://www.hashbangcode.com/) an ascii string
Update:
It turns out the the use of the 'e' flag in preg_replace() isn't valid any more, so you need to use preg_replace_callback() instead. Also, the function needs to be passed a callback instead of just having a string.
I've updated the function a little here:
function html2ascii($s) {
// convert links
$s = preg_replace('/<a\s+.*? href="?([^\">]*)"?[^>]*>(.*?)<\/a>/i', '$2 ($1)', $s);
// convert p, br and hr tags
$s = preg_replace('@<(b|h)r[^>]*>(?=\<)@i', "\n", $s);
$s = preg_replace('@<p[^>]*>(?=\<)@i', "\n\n", $s);
$s = preg_replace('@<div[^>]*>(.*)(?=\<)@i', "\n" . '$1' . "\n", $s);
// convert bold and italic tags
$s = preg_replace('@<b[^>]*>(.*?)(?=\<)@i', '*$1*', $s);
$s = preg_replace('@<strong[^>]*>(.*?)(?=\<)@i', '*$1*', $s);
$s = preg_replace('@<i[^>]*>(.*?)(?=\<)@i', '_$1_', $s);
$s = preg_replace('@<em[^>]*>(.*?)(?=\<)@i', '_$1_', $s);
// decode any entities
$s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));
// decode numbered entities
$s = preg_replace_callback('/&#(\d+);/', function($matches) {
return chr(str_replace(";", "", str_replace("&#", "",$matches[0])));
}, $s);
// strip any remaining HTML tags
$s = strip_tags($s);
// return the string
return $s;
}
However, I really wouldn't use regular expressions to parse HTML.
Comments
preg_replace_callback(): Requires argument 2, 'chr(str_replace(";", "", str_replace("&#","","$0")))', to be a valid callback
Thanks for the info Tradesouthwest. I have updated the post with some changes.
Really helpful piece of code -- thanks! I had to add some replacements for my specific case, but the rest worked well (after all these years).
// LER: begin
// undo my prior replace space and tag chars
$s = str_replace('◃', '<', $s);
$s = str_replace('▹', '>', $s);
$s = str_replace(' ', ' ', $s);
// convert my line numbers, after above, * or **\d{1,3} to \n
$s = preg_replace('@\*{1,3}\d{1,4}:\[email protected]',"\n", $s);
// strtr above is changing some spaces or tabs to hex a0 and c2 chars, undo it
$s = preg_replace('@\[email protected]', ' ', $s);
$s = preg_replace('@\[email protected]', '', $s);
// my tabs, but above replacements resulted in 3 not 4 spaces per tab
$s = str_replace(' ', "\t", $s);
$s = str_replace(" \n", "\n", $s); // ok; preg_replace('@\[email protected]', "\n", $s) FAILED!
// LER: end
Thanks for the info Lawrence, glad you found it useful!