pg_trgm for address search

Sumit Raja <sumitraja@xxxxxxxxx> · Thu, 4 Feb 2021 16:24:59 +1100

Hello,

I  am trying to get a functioning postgres address search capability for Australian addresses using tsearch or pg_trgm. pg_trgm is actually better suited in this case as it allows progressive entry e.g. ' 1,  20 Kle' gives a good set of results across Kelm Avenue, Kleins Av etc. 

Good performance with tsearch (<1s response times) for non progressive matches but the pg_trgm performance varies from 3.5 to 15 seconds.

Table is very simple:

CREATE TABLE address_search.tsearch_address_detail (
  address_detail_pid character varying(15) NOT NULL,
  address_state_abbreviation character varying(3) NOT NULL,
  address_concat text not null,
  address_concat_ts tsvector
);

CREATE INDEX idx_places_trgm_gin_addr ON address_search.tsearch_address_detail USING gin(address_concat gin_trgm_ops);

CREATE INDEX idx_places_trgm_gist_addr ON address_search.tsearch_address_detail USING gist(address_concat gist_trgm_ops);

CREATE INDEX idx_places_ts_gin_addr ON address_search.tsearch_address_detail USING GIN (address_concat_ts);

Actual data looks like the below:

 address_detail_pid | address_state_abbreviation |                                 address_concat                                  |                                                       address_concat_ts  
--------------------+----------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------
 GAWA_163274127     | WA                         | Unit 1,  20 Klem Avenue, Salter Point, WA 6152                                  | '1':2 '20':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
 GAWA_163274129     | WA                         | Unit 2,  20 Klem Avenue, Salter Point, WA 6152                                  | '2':2 '20':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
 GAWA_163274130     | WA                         | Unit 2,  3 Klem Avenue, Salter Point, WA 6152                                   | '2':2 '3':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
 GAWA_163274255     | WA                         | 11 Lancefield Street, Laverton, WA 6440                                         | '11':1 '6440':6 'lancefield':2 'laverton':4 'street':3 'wa':5
 GAWA_163274256     | WA                         | 13 Lancefield Street, Laverton, WA 6440                                         | '13':1 '6440':6 'lancefield':2 'laverton':4 'street':3 'wa':5

and query is:

SELECT address_detail_pid, address_concat, word_similarity('1, 20 kle', address_concat) AS sml
    FROM address_search.tsearch_address_detail
    WHERE '1, 20 kle' <% address_concat
    ORDER BY sml DESC limit 10;

The explain (analyze, buffer) is (https://explain.depesz.com/s/tvZ9):
----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=52551.77..52551.79 rows=10 width=65) (actual time=3119.791..3119.793 rows=10 loops=1)
   Buffers: shared hit=6432 read=4564
   ->  Sort  (cost=52551.77..52590.50 rows=15492 width=65) (actual time=3119.788..3119.789 rows=10 loops=1)
         Sort Key: (word_similarity('1, 20 kle'::text, address_concat)) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=6432 read=4564
         ->  Bitmap Heap Scan on tsearch_address_detail  (cost=252.06..52216.99 rows=15492 width=65) (actual time=547.976..3119.067 rows=394 loops=1)
               Recheck Cond: ('1, 20 kle'::text <% address_concat)
               Rows Removed by Index Recheck: 3791
               Heap Blocks: exact=3991
               Buffers: shared hit=6429 read=4564
               ->  Bitmap Index Scan on idx_places_trgm_gin_addr  (cost=0.00..248.19 rows=15492 width=0) (actual time=547.380..547.380 rows=4185 loops=1)
                     Index Cond: ('1, 20 kle'::text <% address_concat)
                     Buffers: shared hit=3991 read=3011
 Planning Time: 44.701 ms
 Execution Time: 3120.052 ms
(16 rows)

Table size is 3026MB and GIN index size is 293 MB.

I've increased shared_buffers to 800MB, work_mem=1500MB, effective_cache_size=2GB.

Are there any optimisations I can make or should I be building the data set differently for better searching by pg_trgm? Or using a combination of tsearch and pg_trgm?

Thanks

Sumit