Consider Spaces in pg_trgm for Better Similarity

"Igal @ Lucee.org" <igal@xxxxxxxxx> · Sun, 28 Jan 2018 21:56:26 -0800



    Is there a way to consider white space in tri-grams?  That would
      allow for better matches of phrases.  

    
    For example, currently "one two three" and "three two one" would
      generate the same tri-grams ({  o,  t, on, th, tw,ee ,hre,ne
      ,one,ree,thr,two,wo }), and the distance of "one two four" will be
      the same for both of them.  The query:
    SELECT   phrase

              ,input

              ,similarity(t1.phrase, t2.input)

              ,word_similarity(t1.phrase, t2.input)

      FROM      (values('one two three'),('three two one'))
        t1(phrase)

              ,(values('one two four')) t2(input);

    
    Returns:
    phrase        |input        |similarity  |word_similarity |

      --------------|-------------|------------|----------------|

      one two three |one two four |0.444444448 |0.615384638    
        |

      three two one |one two four |0.444444448 |0.615384638    
        |

    
    But surely "one two four" is more similar to "one two three" than
      to "three two one".

    
    Any thoughts?

      
      Igal Sapir
        

        Lucee Core Developer
        

        Lucee.org