Re: How to speed up pg_trgm / gin index scan

Jeff Janes <jeff.janes@xxxxxxxxx> · Mon, 22 Jun 2015 15:10:04 -0700

On Mon, Jun 22, 2015 at 10:39 AM, Christian Ramseyer <rc@xxxxxxxxxxx> wrote:
On 22/06/15 19:00, Jeff Janes wrote:

>

>

>     A typical query on this table looks like this:

>

>     explain analyze

>     select log_date, host, msg

>     from logs_01 as log   where  log.msg like '%192.23.33.177%'

>         and log.log_date >= '2015-1-18 1:45:24'

>         and log.log_date <= '2015-1-19 1:45:24'

>         order by log_date asc offset 200 limit 50;

>

>

> I think that trigram indexes are not well-suited to searching IP addresses.

>

> If the typical query is always an IP address for the LIKE, I think you

> would want to build an index specifically tailored to that.  You could

> make a function to parse the IP address out of the msg, and then make a

> functional index, for example. It would require you to write the query

> differently.  Whether it would be a btree index or a gin index would

> depend on whether you can have more than one IP address in a msg.

>

Thanks Jeff, but the IP address was mostly an example... I should have

written this more clearly. Generally the searched string will be a

random short fragment from msg (ip, hostname, some part of an error

message etc.).

What kind of timings do you get if you search on a hostname or part of an error message?  Is it slow in general, or just when the thing you search on happens to be an IP address?  

Certainly in my hands, trigram index searching with embedded IP addresses are much worse than on embedded natural language fragments.

It must be matched exactly including all punctuation etc, so trigrams

look very suitable.

I believe the default compilation of pg_trgm ignores all punctuation (converts them to whitespace) in the index.  For a LIKE query, it catches them when it rechecks the actual tuple in the heap so you still get the right answer.  But if the query is mostly punctuation and short numbers, it takes much longer to get that correct answer.

Since the time range you query over is narrow and the rows are probably well-clustered on it, maybe just using the logs_01_date_index would be faster and then just filtering the table with the LIKE clause:

explain (analyze, buffers)
select log_date, host, msg
from logs_01 as log   where  (log.msg||'') like '%192.23.33.177%'
    and log.log_date >= '2015-1-18 1:45:24'
    and log.log_date <= '2015-1-19 1:45:24'
    order by log_date asc offset 200 limit 50;
Other options would be making your partitions smaller, or using btree_gist and make an index on (log_date, msg gist_trgm_ops).  Unfortunately those indexes can be awful slow to build.

Cheers,

Jeff