Re: substring index what is better way to query

Tieson Molly <tvmaly@xxxxxxxxx> · Tue, 20 Jun 2017 10:40:43 -0400

Tom, is there a different construct than the Similar To that would work?

I know for certain that the first few characters could be different due to the nature of geohashes.  So I may not be able to optimize the prefix aspect in some cases.

Best regards,

Ty

On Jun 20, 2017 10:19 AM, "Tom Lane" <tgl@xxxxxxxxxxxxx> wrote:
Tieson Molly <tvmaly@xxxxxxxxx> writes:

> I have an index I created on the prefix of a column:

> create index location_geo_idx ON locations( substring(geohash, 0, 5));

> I currently use a similar to query, but I wanted to know if there is a

> faster way to query multiple value using this index than this?

> select l.geohash from locations l where l.geohash similar to '(dr7g|dr7e)%';

Well, you've got a couple of problems there.  The most basic one is that

that index doesn't match that query at all.  You need to arrange things

so that the lefthand side of the SIMILAR TO operator is exactly the

indexed value, not something that's related to it.  (Yes, in principle

that index could be used to answer this query, but it would require a

great deal more intimate knowledge than the planner has about the

semantics of both substring() and SIMILAR TO.)  IOW, you need to write

select l.geohash from locations l

  where substring(l.geohash, 0, 5) similar to '(dr7g|dr7e)%';

The other possible solution would be to just index the geohash strings

verbatim; unless they are quite long, that's what I'd recommend, usually.

Secondly, if you're using a non-C locale, you're likely not getting an

indexscan plan anyway; check it with EXPLAIN.  To get an indexed prefix

search out of a pattern match, the index has to use C sorting rules,

which you can force with a COLLATE or text_pattern_ops option if the

database's prevailing locale isn't C.

Thirdly, if you experiment with EXPLAIN a little bit, you'll soon realize

that the planner is not great at extracting common prefix strings out of

OR'd pattern branches:

regression=# create table loc (f1 text unique);

CREATE TABLE

regression=# explain select * from loc where f1 similar to '(dr7g|dr7e)%';

                               QUERY PLAN

-------------------------------------------------------------------------

 Bitmap Heap Scan on loc  (cost=4.22..14.37 rows=1 width=32)

   Filter: (f1 ~ '^(?:(?:dr7g|dr7e).*)$'::text)

   ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)

         Index Cond: ((f1 >= 'd'::text) AND (f1 < 'e'::text))

(4 rows)

The useful part of this for speed purposes is the "Index Cond", and

you can see that it's only enforcing that the first character be "d".

I don't remember that code very well at the moment, but I'm a bit

surprised that it's even figured out that the "d" is common to both

branches.  You can get a lot more traction if you factor the common

prefix manually:

regression=# explain select * from loc where f1 similar to 'dr7(g|e)%';

                               QUERY PLAN

-------------------------------------------------------------------------

 Bitmap Heap Scan on loc  (cost=4.22..14.37 rows=1 width=32)

   Filter: (f1 ~ '^(?:dr7(?:g|e).*)$'::text)

   ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)

         Index Cond: ((f1 >= 'dr7'::text) AND (f1 < 'dr8'::text))

(4 rows)

or maybe even

regression=# explain select * from loc where f1 similar to 'dr7g%' or f1 similar to 'dr7e%';

                                  QUERY PLAN

-------------------------------------------------------------------------------

 Bitmap Heap Scan on loc  (cost=8.45..19.04 rows=2 width=32)

   Recheck Cond: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~ '^(?:dr7e.*)$'::text))

   Filter: ((f1 ~ '^(?:dr7g.*)$'::text) OR (f1 ~ '^(?:dr7e.*)$'::text))

   ->  BitmapOr  (cost=8.45..8.45 rows=14 width=0)

         ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)

               Index Cond: ((f1 >= 'dr7g'::text) AND (f1 < 'dr7h'::text))

         ->  Bitmap Index Scan on loc_f1_key  (cost=0.00..4.22 rows=7 width=0)

               Index Cond: ((f1 >= 'dr7e'::text) AND (f1 < 'dr7f'::text))

(8 rows)

Whether this is worth the trouble depends a lot on your data distribution,

but any of them are probably better than the seqscan you're no doubt

getting right now.

                        regards, tom lane