On 9/28/09 7:07 AM, "Merlin Morgenstern" <merlin_x@xxxxxxxxxxx> wrote: > > > Ashley Sheridan wrote: >> On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote: >>> Hi there, >>> >>> I am trying to find out similarity between 2 strings. Somehow the >>> similar_text function returns 33% similarity on strings that are not >>> even close and on the other hand it returns 21% on strings that have a >>> matching word. >>> >>> E.G: >>> >>> 'gemütliche sofas' >>> >>> Wohngemeinschaften - similarity: 33.333333333333 >>> Sofas & Sessel - similarity: 31.25 >>> >>> I am using this code: >>> similar_text($data[txt], $categories[$i], $similarity); >>> >>> Does anybody have an idea why it gives back 33% similarity on the first >>> string? >>> >>> Thank you for any help, >>> >>> Merlin >>> >> >> If you think about it, it makes sense. >> >> Taking your three sentences above, 'Wohngemeinschaften' has more >> characters similar towards the start of the string (you only have to go >> 4 characters in to start a match) whereas 'sofas' won't match the source >> string until the 12th string in. Also, both test strings have the same >> number of characters that match in order, although the ones that match >> in 'Wohngemeinschaften' are separated by characters that do not match, >> so I'm not sure what bearing this will have. >> >> As noted on the manual page for this function, the similar_text() >> function compares without regard to string length, and tends to only >> really be accurate enough for larger excerpts of text. >> >> Thanks, >> Ash >> http://www.ashleysheridan.co.uk >> >> >> > > Sounds logical. Is there another function you suggest? I guess this is a > standard problem I am having here. I tried it with levenstein, but > similar results. > > e.g levenstein (smaller = better): > Search for : Stellplatz fÃ1?4r Wohnwagen gesucht > Stereoanlagen : 23 > Wohnwagen, -mobile : 24 > Sonstiges fÃ1?4r Baby & Kind - : 25 > Steuer & Finanzen - :25 > > How come stereoanlagen and the others shows up here? > > Any idea how I could make this more accurate? > > Thank you for any help, Merlin as ashley pointed out, it's not a trivial problem. if you are performing the tests against strings in a db table then a full text index might help. see, e.g.: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html you could also check out the php sphinx client http://us3.php.net/manual/en/book.sphinx.php if you are writing your own solutions and using utf8, take care with similar_text() or levenshtein(). i don't think they are designed for multibyte strings. so if you are using utf8 they will probably report bigger differences that you might expect. i wrote my own limited damerau-levenshtein function for utf8. even if you're using a single byte encoding, i would guess they ignore a locale's collation. so say you set a german locale, ü will be regarded as different from both u and ue. again, if you are searching against against strings in a db table, the dbms may understand collations properly. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php