Search Postgresql Archives

Consider Spaces in pg_trgm for Better Similarity

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Is there a way to consider white space in tri-grams?  That would allow for better matches of phrases. 

For example, currently "one two three" and "three two one" would generate the same tri-grams ({  o,  t, on, th, tw,ee ,hre,ne ,one,ree,thr,two,wo }), and the distance of "one two four" will be the same for both of them.  The query:

SELECT   phrase
        ,input
        ,similarity(t1.phrase, t2.input)
        ,word_similarity(t1.phrase, t2.input)
FROM      (values('one two three'),('three two one')) t1(phrase)
        ,(values('one two four')) t2(input);

Returns:

phrase        |input        |similarity  |word_similarity |
--------------|-------------|------------|----------------|
one two three |one two four |0.444444448 |0.615384638     |
three two one |one two four |0.444444448 |0.615384638     |

But surely "one two four" is more similar to "one two three" than to "three two one".

Any thoughts?

Igal Sapir
Lucee Core Developer
Lucee.org


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux