Wednesday, July 29, 2009 - 09:27
A common issue I have come across in the past is that I have a CMS system, or an old copy of Wordpress, and I need to create a set of keywords to be used in the meta keywords field. To solve this I put together a simple function that runs through a string and picks out the most commonly used words in that list as an array. This is currently set to be 10, but you can change that quite easily.
The first thing the function defines is a list of "stop" words. This is a list of words that occur quite a bit in English text and would therefore interfere with the outcome of the function. The function also uses a variant of the slug function to remove any odd characters that might be in the text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | function extractCommonWords($string){ $stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www'); $string = preg_replace('/\s\s+/i', '', $string); // replace whitespace $string = trim($string); // trim the string $string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too… $string = strtolower($string); // make it lowercase preg_match_all('/\b.*?\b/i', $string, $matchWords); $matchWords = $matchWords[0]; foreach ( $matchWords as $key=>$item ) { if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) { unset($matchWords[$key]); } } $wordCountArr = array(); if ( is_array($matchWords) ) { foreach ( $matchWords as $key => $val ) { $val = strtolower($val); if ( isset($wordCountArr[$val]) ) { $wordCountArr[$val]++; } else { $wordCountArr[$val] = 1; } } } arsort($wordCountArr); $wordCountArr = array_slice($wordCountArr, 0, 10); return $wordCountArr; } |
The function returns the 10 most commonly occurring words as an array, with the key as the word and the amount of times it occurs as the value. To extract the words just use the implode() function in conjunction with the array_keys() function. To change the number of words returned just alter the value in the third parameter of the array_slice() function near the return statement, currently set to 10. Here is an example of the function in action.
1 2 3 | $text = "This is some text. This is some text. Vending Machines are great."; $words = extractCommonWords($text); echo implode(',', array_keys($words)); |
This produces the following output.
some,text,machines,vending
Comments
Submitted by Anonymous (not verified) on Sun, 06/20/2010 - 13:29 Permalink
$str="<a id=home href=http://d-grund.dk/ title=D-Grund.DK target=_blank><img id=home src=http://d_grund.dk/images/homelogo.gif alt=D-Grund.DK border=1 width=166 height=90 /></a>";and i will split it up so i can change the values and sample it again to work with echo $str; sorry for my bad english. i have searched the entire net and nothing is what i need so i hope anyone can help me. my solution is not good because it bet big and time consuming. and i have find some on the internet that look a like that i need but i cant sample it again without destroying anything. sincerely LAT D-Grund.dk Administrator my email is lat@ofir.dkSubmitted by Jaco (not verified) on Fri, 07/02/2010 - 13:21 Permalink
Submitted by Dusan (not verified) on Tue, 07/20/2010 - 01:49 Permalink
Submitted by Ronnie (not verified) on Fri, 09/03/2010 - 19:00 Permalink
strtolower()call inside both foreach loops is really not needed either since you have already converted $string to lowercase beforehand.Submitted by Guillermo (not verified) on Wed, 01/19/2011 - 05:17 Permalink
Submitted by Guillermo (not verified) on Wed, 01/19/2011 - 05:23 Permalink
Submitted by Guillermo (not verified) on Wed, 01/19/2011 - 23:06 Permalink
Submitted by Matt Hawkins (not verified) on Thu, 01/27/2011 - 11:39 Permalink
if ( $item == '' || in_array($item, $stopWords) || strlen($item) <= 3 ) {function extractCommonWords($string,$count){ ... $wordCountArr = array_slice($wordCountArr, 0, $count); return $wordCountArr; }Submitted by Brit (not verified) on Wed, 03/23/2011 - 11:52 Permalink
Brilliant. Thanks for this piece of code. I was after doing something very similar so have used this as a base and made a few tweaks here and there so it fits my needs.
Submitted by Mkj (not verified) on Sat, 05/21/2011 - 10:37 Permalink
Hi
Great code and is working a treat on my site but I have a problem with words containing 'ss'. The double ss is being removed so a word such as 'password' is being seen as 'paword' in the keywords.
Regards
Submitted by Tolenca (not verified) on Wed, 06/22/2011 - 20:46 Permalink
Hi, would you please to let me know how I could include spanish characters too?.
For example I need to print: 'tamaño' instead I get: 'tamao'
Thanks in advance
Tolenca
You need to add those special characters to the regular expression on line 6 of the examples above. You just need to make sure it takes alphanumeric characters as well as the spanish letters. I think that should work.
Submitted by Joe Bowman (not verified) on Fri, 09/02/2011 - 15:09 Permalink
instead of
$wordCountArr = array();if ( is_array($matchWords) ) { foreach ( $matchWords as $key => $val ) { $val = strtolower($val); if ( !isset($wordCountArr[$val]) { $wordCountArr[$val] = array(); } if ( isset($wordCountArr[$val]['count']) ) { $wordCountArr[$val]['count']++; } else { $wordCountArr[$val]['count'] = 1; } }I did:
$ignoreOccur = array(1,2); $wordCountArr = array_diff(array_count_values(explode(" ", matchWords)), $ignoreOccur);To get all assoc array $wordCoundArr of [word] => [occurences], ignoring words that have occurred 1 or 2 (or specified number of) times.
Interesting take on it. I like it! :)
Submitted by Gerhard Racter (not verified) on Thu, 10/27/2011 - 16:49 Permalink
I changed line 13 to not include 'keywords' that are just numbers using the is_numeric() function
if ( $item == '' || in_array($item, $stopWords) || strlen($item) >= 3 || is_numeric($item) ) {Submitted by HSMoore (not verified) on Tue, 11/08/2011 - 10:43 Permalink
Thanks, this is quite useful!
Submitted by Anonymous (not verified) on Thu, 05/03/2012 - 16:23 Permalink
Make sure this line:
reads like this:
Submitted by wookiester (not verified) on Fri, 05/04/2012 - 11:53 Permalink
This script just returns a string of comma separated keywords and supports multibyte characters.
Adjust your stopwords for your locale.
Submitted by wookiester (not verified) on Fri, 05/04/2012 - 12:11 Permalink
Whoops.. Ignore my previous code block.. it was bugged.. try this.
Submitted by Mohamed saad (not verified) on Thu, 05/17/2012 - 16:51 Permalink
plz help me
how to use for arabic text such as "اب ت ث "?
thank you
Submitted by Bob Davies (not verified) on Tue, 08/14/2012 - 08:14 Permalink
$string = preg_replace('/[^a-zA-Z0-9 -]/', '', $string);with this:$string = preg_replace('/[^\w\d -]/', '', $string);It should support extended characters, not sure off the top of my head how much unicode is covered by \w (if any), but it works for a bunch of quick tests I've put through it.Submitted by summsel (not verified) on Wed, 11/07/2012 - 17:57 Permalink
preg_replace('/[ ]{2,}/sm', ' ', $text)Submitted by weter (not verified) on Tue, 01/01/2013 - 07:36 Permalink
i see if $wordCountArr contain number (ex. year), the output value for it's 0.
For return the corect value you must modify
$wordCountArr = array_slice($wordCountArr, 0, 10);
in
$wordCountArr = array_slice($wordCountArr, 0, 10,$preserve_keys=true);
Add new comment