Extract Links From A HTML File With PHP

6th March 2008 - 1 minute read time

Use the following function to extract all of the links from a HTML string.

function linkExtractor($html)
{
 $linkArray = array();
 if(preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i', $html, $matches, PREG_SET_ORDER)){
  foreach ($matches as $match) {
   array_push($linkArray, array($match[1], $match[2]));
  }
 }
 return $linkArray;
}

To use it just read a web page or file into a string, and pass that string to the function. The following example reads a web page using the PHP CURL functions and then passes the result into the function to retrieve the links.

$url = 'http://www.hashbangcode.com';	
$ch = curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12');
curl_setopt($ch,CURLOPT_HEADER,0);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
curl_setopt($ch,CURLOPT_TIMEOUT,120);
$html = curl_exec($ch);
curl_close($ch);
echo '<pre>' . print_r(linkExtractor($html), true) . '<pre>';

The function will return an array, with each element being an array containing the link location and the text that the link contains.

Comments

Permalink
Cool Scripts.

Mark James (Thu, 09/04/2008 - 09:27)

Permalink
Hi Philip, I have similar script:\n"); PRINT("
    \n"); WHILE(!FEOF($page)) { $line = FGETS($page, 255); WHILE(EREGI("HREF=\"[^\"]*\"", $line, $match)) { PRINT("
  • "); PRINT($match[0]); PRINT("
    \n"); $replace = EREG_REPLACE("\?", "\?", $match[0]); $line = EREG_REPLACE($replace, "", $line); } } PRINT("
\n"); FCLOSE($page); ?>
How do I get links only with .zip extension and without href=" and in the end of each line " as well?

Bilal (Sun, 05/10/2015 - 21:24)

Add new comment

The content of this field is kept private and will not be shown publicly.