Extract Keywords From A Text String With PHP

A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.

The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.

function extractCommonWords($string){
      $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');
   
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase
   
      preg_match_all('/\b.*?\b/i', $string, $matchWords);
      $matchWords = $matchWords[0];
      
      foreach ( $matchWords as $key=>$item ) {
          if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
              unset($matchWords[$key]);
          }
      }   
      $wordCountArr = array();
      if ( is_array($matchWords) ) {
          foreach ( $matchWords as $key => $val ) {
              $val = strtolower($val);
              if ( isset($wordCountArr[$val]) ) {
                  $wordCountArr[$val]++;
              } else {
                  $wordCountArr[$val] = 1;
              }
          }
      }
      arsort($wordCountArr);
      $wordCountArr = array_slice($wordCountArr, 0, 10);
      return $wordCountArr;
}

The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.

$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));

This produces the following output.

some,text,machines,vending

Comments

split a command stored in a variable

hello can any one help me with a problem?
I have a variable containing a html commando like this:

$str="<a id=home href=http://d-grund.dk/ title=D-Grund.DK target=_blank><img id=home src=http://d_grund.dk/images/homelogo.gif alt=D-Grund.DK border=1 width=166 height=90 /></a>";

and i will split it up so i can change the values and sample it again to work with echo $str;

sorry for my bad english.

i have searched the entire net and nothing is what i need so i hope anyone can help me.

my solution is not good because it bet big and time consuming.

and i have find some on the internet that look a like that i need but i cant sample it again without destroying anything.

sincerely LAT D-Grund.dk Administrator

my email is lat@ofir.dk

philipnorton42's picture

I would love to help but I'm

I would love to help but I'm not sure what you are trying to do. Passing this string through the extractCommonWords() function probably won't produce any meaningful output as the string is just HTML containing a link and an image.

the script doesnt want to work:(:( and it is exactly what i need

<?php
function extractCommonWords($string){
$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www');

$string = preg_replace('/ss+/i', '', $string);
$string = trim($string); // trim the string
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…
$string = strtolower($string); // make it lowercase

preg_match_all('/\b.*?\b/i', $string, $matchWords);
$matchWords = $matchWords[0];

foreach ( $matchWords as $key=>$item ) {
if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
unset($matchWords[$key]);
}
}
$wordCountArr = array();
if ( is_array($matchWords) ) {
foreach ( $matchWords as $key => $val ) {
$val = strtolower($val);
if ( isset($wordCountArr[$val]) ) {
$wordCountArr[$val]++;
} else {
$wordCountArr[$val] = 1;
}
}
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;
}
}

$text = "This is some text. This is some text. Vending Machines are great.";
$words = extractCommonWords($text);
echo implode(',', array_keys($words));
?>

philipnorton42's picture

Thanks for the input

Thanks for the input, after running the script it was clear that it wouldn't work as there was one too many curly braces in the function.
I have removed this in the example so it will run now.

Do you maybe know, what would

Do you maybe know, what would be the problem, .. when I get my keywords with your script.. I always get letter "p" added right before first keyword and right after last keyword.. os the keyword list looks like this:

pkeyword1, keyword2...... keyword10p .. and I dont know what causes this.. thanks

philipnorton42's picture

Are you passing it a HTML

Are you passing it a HTML string with p tags at the start and end? Try using strip_tags() first.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h2> <h3> <h4> <h5> <h6> <pre> <span>
  • Lines and paragraphs break automatically.
  • Syntax highlight code surrounded by the {syntaxhighlighter OPTIONS}...{/syntaxhighlighter} tags.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.