Re: Text similarity

Ashley Sheridan <ash@xxxxxxxxxxxxxxxxxxxx> · Mon, 28 Sep 2009 12:18:33 +0100

On Mon, 2009-09-28 at 13:07 +0200, Merlin Morgenstern wrote:
> 
> Ashley Sheridan wrote:
> > On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
> >> Hi there,
> >>
> >> I am trying to find out similarity between 2 strings. Somehow the 
> >> similar_text function returns 33% similarity on strings that are not 
> >> even close and on the other hand it returns 21% on strings that have a 
> >> matching word.
> >>
> >> E.G:
> >>
> >> 'gemütliche sofas'
> >>
> >> Wohngemeinschaften - similarity: 33.333333333333
> >> Sofas & Sessel - similarity: 31.25
> >>
> >> I am using this code:
> >> similar_text($data[txt], $categories[$i], $similarity);
> >>
> >> Does anybody have an idea why it gives back 33% similarity on the first 
> >> string?
> >>
> >> Thank you for any help,
> >>
> >> Merlin
> >>
> > 
> > If you think about it, it makes sense.
> > 
> > Taking your three sentences above, 'Wohngemeinschaften' has more
> > characters similar towards the start of the string (you only have to go
> > 4 characters in to start a match) whereas 'sofas' won't match the source
> > string until the 12th string in. Also, both test strings have the same
> > number of characters that match in order, although the ones that match
> > in 'Wohngemeinschaften' are separated by characters that do not match,
> > so I'm not sure what bearing this will have.
> > 
> > As noted on the manual page for this function, the similar_text()
> > function compares without regard to string length, and tends to only
> > really be accurate enough for larger excerpts of text.
> > 
> > Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> > 
> > 
> > 
> 
> Sounds logical. Is there another function you suggest? I guess this is a 
> standard problem I am having here. I tried it with levenstein, but 
> similar results.
> 
> e.g levenstein (smaller = better):
> Search for : Stellplatz fÃ¼r Wohnwagen gesucht
> Stereoanlagen : 23
> Wohnwagen, -mobile : 24
> Sonstiges fÃ¼r Baby & Kind - : 25
> Steuer & Finanzen - :25
> 
> How come stereoanlagen and the others shows up here?
> 
> Any idea how I could make this more accurate?
> 
> Thank you for any help, Merlin
> 

I'm guessing it's to do with the position of characters within the
string. You could roll your own function, that does what you
specifically need.

Break down the lines into individual words.

Loop through the match string and see if the words exist within the
phrases you're searching in, and keep a count of all 'hits'. As you
loop, create a metaphone key (soundex might also help, but I think
metaphone will work fairly well for German) and check this against a
metaphone version of the phrases you're searching in. Keep a separate
count of metaphone matches.

At the end, any of the search phrases that have either type of count is
a match. Collate them, and order by solid matches (whole words) and then
by metaphone matches.

This is very simplified, and will need a lot of tweaking to get it
right, but it might be somewhere to start?

Thanks,
Ash
http://www.ashleysheridan.co.uk

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php