Re: Text similarity

Tom Worster <fsb@xxxxxxxxxx> · Mon, 28 Sep 2009 13:25:52 -0400

On 9/28/09 7:07 AM, "Merlin Morgenstern" <merlin_x@xxxxxxxxxxx> wrote:

> 
> 
> Ashley Sheridan wrote:
>> On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
>>> Hi there,
>>> 
>>> I am trying to find out similarity between 2 strings. Somehow the
>>> similar_text function returns 33% similarity on strings that are not
>>> even close and on the other hand it returns 21% on strings that have a
>>> matching word.
>>> 
>>> E.G:
>>> 
>>> 'gemütliche sofas'
>>> 
>>> Wohngemeinschaften - similarity: 33.333333333333
>>> Sofas & Sessel - similarity: 31.25
>>> 
>>> I am using this code:
>>> similar_text($data[txt], $categories[$i], $similarity);
>>> 
>>> Does anybody have an idea why it gives back 33% similarity on the first
>>> string?
>>> 
>>> Thank you for any help,
>>> 
>>> Merlin
>>> 
>> 
>> If you think about it, it makes sense.
>> 
>> Taking your three sentences above, 'Wohngemeinschaften' has more
>> characters similar towards the start of the string (you only have to go
>> 4 characters in to start a match) whereas 'sofas' won't match the source
>> string until the 12th string in. Also, both test strings have the same
>> number of characters that match in order, although the ones that match
>> in 'Wohngemeinschaften' are separated by characters that do not match,
>> so I'm not sure what bearing this will have.
>> 
>> As noted on the manual page for this function, the similar_text()
>> function compares without regard to string length, and tends to only
>> really be accurate enough for larger excerpts of text.
>> 
>> Thanks,
>> Ash
>> http://www.ashleysheridan.co.uk
>> 
>> 
>> 
> 
> Sounds logical. Is there another function you suggest? I guess this is a
> standard problem I am having here. I tried it with levenstein, but
> similar results.
> 
> e.g levenstein (smaller = better):
> Search for : Stellplatz fÃ1?4r Wohnwagen gesucht
> Stereoanlagen : 23
> Wohnwagen, -mobile : 24
> Sonstiges fÃ1?4r Baby & Kind - : 25
> Steuer & Finanzen - :25
> 
> How come stereoanlagen and the others shows up here?
> 
> Any idea how I could make this more accurate?
> 
> Thank you for any help, Merlin

as ashley pointed out, it's not a trivial problem.

if you are performing the tests against strings in a db table then a full
text index might help. see, e.g.:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

you could also check out the php sphinx client
http://us3.php.net/manual/en/book.sphinx.php

if you are writing your own solutions and using utf8, take care with
similar_text() or levenshtein(). i don't think they are designed for
multibyte strings. so if you are using utf8 they will probably report bigger
differences that you might expect. i wrote my own limited
damerau-levenshtein function for utf8.

even if you're using a single byte encoding, i would guess they ignore a
locale's collation. so say you set a german locale, ü will be regarded as
different from both u and ue. again, if you are searching against against
strings in a db table, the dbms may understand collations properly.

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php