Hello, please sign in or register
You are here: Home

Spelling Dictionary Dataset with aspell and pspell

Misspelling words is as common as Chav's in Romford. An intuitive almost omniscent website can make a users experience very .... kind!

Aspell and Pspell are a neat collection of PHP functions which whilst being in the shadows are very enlightening.

Installing Process

Download

The standard distribution of PHP does not include the modules. For Windows Download libary Aspell-0-50-3-3-Setup and the languages you want English dataset Aspell-en-0.50-2-3.exe, see references at the bottom of this page for more contacts

Install

Follow these instructions from oblius.com

3. Copy ASpell DLL's to System32

Browse to the /bin/ folder in the ASpell program directory (default is C:\Program Files\ASpell\bin) and copy the aspell-15.dll to your %Win32%/System32 folder (i.e. like you would for files like php4apache2.dll, etc.) NOTE: In PHP 5.x you can probably copy aspell-15.dll to the root PHP folder as well, but for compatibility with PHP 4.x and 5.x, use %Win32%/System32/.

4. Edit php.ini and enable the pspell extension

Open up your php.ini and find the block where you enable extensions. I've noticed that in both PHP 4.3.x and 5.0.x, the pspell extension DLL isn't in the list of commented-out extensions. However, the DLL file is provided with the PHP package and should be found in your extensions folder under "php_pspell.dll". Thus, you must manually add a line to load this extension:

extension = php_pspell.dll

5. Restart Apache

Restart Apache and verify that it came up properly. It should, but if not, double-check your spelling of the extension in php.ini and that you've copied the aspell DLL files to the System32 folder.

from oblius.com

Try it out



// And you should see in your browser

// harry potter and the philosophers stone

Spelling Variations "Hitchhikers" vs "Hitch hikers"

Searching on amazon i found their search recognise "Hitchhikers" interchangably with "Hitch hikers" and be given the same result - matches that include both spellings of the title.
The code i've given above splits the string using spaces and finds suggestions for each word that is not spelt correctly. However both "hitchhikers", "hitch" and "hikers" are all recognised words. So how to account for the variations?

Using the concatenated word "hitchhikers" Pspell returns its sub word parts "hitch" and "hikers" amongst its other suggestions

output

Array

(

    [0] => hitchhikers

    [1] => hitchhikes

    [2] => hitch hikers

    [3] => hitch-hikers

    [4] => hitchhike rs

    [5] => hitchhike-rs

    [6] => hitchhiker

    [7] => hitchhiked

    [8] => hitchhike

)

Using the above array results and the MATCH(field(s)) AGAINST('expression' IN BOOLEAN MODE) MySQL syntax. My query could look like...
SELECT ISBN, title_search, title, authors
FROM book
WHERE MATCH(title_search) AGAINST ('+(

The above method works fine to split words into their core parts, you can also include a substring match to ensure there are no wacky suggestions cropping up and obscuring the database search however. Passing the sub word string "Hitch Hikers" will not work as Pspell will not know to concat the words "Hitch" and "Hikers", recreate a similar query as above and this not return results that have the concatinated version "Hitchhikers".

The solution is relatively simple yet database imposing.
Include in your title_search field words that are sub words of large concatinated words.

GetAll($q);



	foreach($r as $row){

		//if(new_title_search($row['title']) != $row['title_search']){

			$q = "	UPDATE book_temp SET title_phon = '".new_title_search($row['title'])."' 

					WHERE isbn = '".$row['isbn']."'";

			$conn->Execute($q);

		//}

		$COUNT++;

	}

	if(count($r) < $X){ /* Have reached the end */; $bool=false;exit();}

}







function new_title_search(&$title){

	// strip characters 

	$string = str_replace(array('\'s ', 's ','<','>',',','.','?','/',':',';','@','\'','~','#','{','[','}',

		']','|','\\','!','"','£','$','%','^','&','*','(',')','_','+','-','='), ' ', $title.' ');







	// split words

	$b = preg_split('/[\W]+?/',$string);

	foreach($b as $a){

		if($a != ''){

			$r['words'][] = $a;

		}

	}



	$r['return'] = array();



	// Connect to english library

	$int = pspell_new("en", "", "", "", (PSPELL_FAST|PSPELL_RUN_TOGETHER));



	// iterate through each word

	foreach ($r['words'] as $value) {

		// Get list of alternatives

		$c = array();

		$c = pspell_suggest($int, $value);

	//print_p($c);



		foreach($c as $suggest){

			if($value != $suggest && str_replace(' ','', $suggest) == $value && strlen($value) > 6 ){

				foreach(preg_split('/[\W]+?/',$suggest) as $d){

					$r['return'][] = $d;

				};

			}

		}

		$r['return'][] = $value;

	}

	//print_p($r);

	return implode(' ', $r['return']);

}

?>

So records that include "Hitchhikers" will append in the title_search field the sub-words "Hitch" and "hikers". Your users will be able to find what there looking for however its spelt.

More results are better!

Running in CLI

I had a problem when runnign from CLI, and i've read that similar problesm have occured runnign from Apache. I had an error that said Faulting iso8859-1.dat. I found the probelm could be solved on windows by converting this file to Linux. Go to http://www.iconv.com/dos2unix.htm And use the tool to convert the files

aspell/data/iso8859-1.dat
and
aspell/data/standard.kbd
. And you shouldn't have any more complaints in CLI

External References

Comments

1
where
Created 20/03/15
ninestab123
ninest123 One canada goose pas cher gucci outlet thing
Created 21/07/16
Title*
Comment

Prove you are not a robot

To prove you are not a robot, please type in the six character code you see in the picture below
Security confirmation codeI can't see this!
Contact
Name*
Email never shown*
Home Page

Author

Andrew Dodson
Since:Feb 2007

Comment | flag

Categories

Bookmark and Share