Search Postgresql Archives

Re: [pg_trgm] Making similarity(?, ?) < ? use an index

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





2016-06-11 13:47 GMT+03:00 Greg Navis <contact@xxxxxxxxxxxxx>:
I made some progress but I'm stuck. I'm focused on GiST for now. Please ignore sloppy naming for now.

I made the following changes to pg_trgm--1.2.sql:

CREATE TYPE pg_trgm_match AS (match TEXT, threshold REAL);
 
CREATE OR REPLACE FUNCTION trgm_check_match(string TEXT, match pg_trgm_match) RETURNS bool AS $$
BEGIN
    RETURN match.match <-> string <= 1 - match.threshold;
END;
$$ LANGUAGE plpgsql;

CREATE OPERATOR %%(leftarg = text, rightarg = pg_trgm_match, procedure=trgm_check_match);

ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
               OPERATOR                9               %% (text, pg_trgm_match);

You can overload existing % operator:

ALTER OPERATOR FAMILY gist_trgm_ops USING gist ADD
OPERATOR        9       % (text, pg_trgm_match);
 

It does indeed make PostgreSQL complain about undefined strategy 9. I added the following define to trgm.h:

#define ThresholdStrategyNumber 9

It seems StrategyNumber is used in gtrgm_consistent and gtrgm_distance.

In gtrgm_consistent, I need change the way `nlimit` is obtained:

nlimit = (strategy == SimilarityStrategyNumber) ?
similarity_threshold : word_similarity_threshold;

I need to add a case for ThresholdStrategyNumber and extract `nlimit` from the argument of `pg_trgm_match`. I'm not sure what to do in `gtrgm_distance`.

My questions:

1a. Is it possible to make `gtrgm_consistent` accept `text` or `pg_trgm_match` as the second argument?

I think you can change definition of the gtrgm_consistent() in .sql file in CREATE FUNCTION and CREATE OPERATOR CLASS commands to:

gtrgm_consistent(internal,anynonarray,smallint,oid,internal)

But I do not sure that anynonarray is good here.
 
1b. What's the equivalent of `match.match` and `match.threshold` (where `match` is a `pg_trgm_match`) in C?

After changing the definition you can extract values from composite type in the gtrgm_consistent(). I think the code in the beginning of function may looks like this:

if (strategy == SimilarityStrategyNumber ||
strategy == WordSimilarityStrategyNumber)
{
query = PG_GETARG_TEXT_P(1);
nlimit = (strategy == SimilarityStrategyNumber) ?
similarity_threshold : word_similarity_threshold;
}
else if (strategy == ThresholdStrategyNumber)
{
HeapTupleHeader query_match = PG_GETARG_HEAPTUPLEHEADER(1);
Oid tupType = HeapTupleHeaderGetTypeId(query_match);
int32 tupTypmod = HeapTupleHeaderGetTypMod(query_match);
TupleDesc tupdesc = lookup_rowtype_tupdesc(tupType, tupTypmod);
HeapTupleData tuple;
bool isnull;

tuple.t_len = HeapTupleHeaderGetDatumLength(query_match);
ItemPointerSetInvalid(&(tuple.t_self));
tuple.t_tableOid = InvalidOid;
tuple.t_data = query_match;

query = DatumGetTextP(fastgetattr(&tuple, 1, tupdesc, &isnull));
nlimit = DatumGetFloat4(fastgetattr(&tuple, 2, tupdesc, &isnull));

ReleaseTupleDesc(tupdesc);
}
else
query = PG_GETARG_TEXT_P(1);

After this code you should execute the query using index:

select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % row('qwertyu0988', 0.6)::pg_trgm_match;

I got the query from the regression test. And of course the code need to be checked for bugs.
 
2. What to do with `gtrgm_distance`?

You do not need to change gtrgm_distance(). It is used only in ORDER BY clause to calculate distances. To calculate distance you do not need threshold.
 

Thanks for help.
--
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.




--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux