Re: Text similarity

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Mon, 28 Sep 2009 11:37:28 +0100

On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
> Hi there,
> 
> I am trying to find out similarity between 2 strings. Somehow the 
> similar_text function returns 33% similarity on strings that are not 
> even close and on the other hand it returns 21% on strings that have a 
> matching word.
> 
> E.G:
> 
> 'gemütliche sofas'
> 
> Wohngemeinschaften - similarity: 33.333333333333
> Sofas & Sessel - similarity: 31.25
> 
> I am using this code:
> similar_text($data[txt], $categories[$i], $similarity);
> 
> Does anybody have an idea why it gives back 33% similarity on the first 
> string?
> 
> Thank you for any help,
> 
> Merlin
> 

If you think about it, it makes sense.

Taking your three sentences above, 'Wohngemeinschaften' has more
characters similar towards the start of the string (you only have to go
4 characters in to start a match) whereas 'sofas' won't match the source
string until the 12th string in. Also, both test strings have the same
number of characters that match in order, although the ones that match
in 'Wohngemeinschaften' are separated by characters that do not match,
so I'm not sure what bearing this will have.

As noted on the manual page for this function, the similar_text()
function compares without regard to string length, and tends to only
really be accurate enough for larger excerpts of text.

Thanks,
Ash
http://www.ashleysheridan.co.uk

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php