Hello,
I am trying to get a functioning postgres address search capability for Australian addresses using tsearch or pg_trgm. pg_trgm is actually better suited in this case as it allows progressive entry e.g. ' 1, 20 Kle' gives a good set of results across Kelm Avenue, Kleins Av etc.
Good performance with tsearch (<1s response times) for non progressive matches but the pg_trgm performance varies from 3.5 to 15 seconds.
Table is very simple:
CREATE TABLE address_search.tsearch_address_detail (
address_detail_pid character varying(15) NOT NULL,
address_state_abbreviation character varying(3) NOT NULL,
address_concat text not null,
address_concat_ts tsvector
);
CREATE INDEX idx_places_trgm_gin_addr ON address_search.tsearch_address_detail USING gin(address_concat gin_trgm_ops);
CREATE INDEX idx_places_trgm_gist_addr ON address_search.tsearch_address_detail USING gist(address_concat gist_trgm_ops);
CREATE INDEX idx_places_ts_gin_addr ON address_search.tsearch_address_detail USING GIN (address_concat_ts);
Actual data looks like the below:
address_detail_pid | address_state_abbreviation | address_concat | address_concat_ts
--------------------+----------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------
GAWA_163274127 | WA | Unit 1, 20 Klem Avenue, Salter Point, WA 6152 | '1':2 '20':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
GAWA_163274129 | WA | Unit 2, 20 Klem Avenue, Salter Point, WA 6152 | '2':2 '20':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
GAWA_163274130 | WA | Unit 2, 3 Klem Avenue, Salter Point, WA 6152 | '2':2 '3':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
GAWA_163274255 | WA | 11 Lancefield Street, Laverton, WA 6440 | '11':1 '6440':6 'lancefield':2 'laverton':4 'street':3 'wa':5
GAWA_163274256 | WA | 13 Lancefield Street, Laverton, WA 6440 | '13':1 '6440':6 'lancefield':2 'laverton':4 'street':3 'wa':5
and query is:
SELECT address_detail_pid, address_concat, word_similarity('1, 20 kle', address_concat) AS sml
FROM address_search.tsearch_address_detail
WHERE '1, 20 kle' <% address_concat
ORDER BY sml DESC limit 10;
The explain (analyze, buffer) is (https://explain.depesz.com/s/tvZ9):
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=52551.77..52551.79 rows=10 width=65) (actual time=3119.791..3119.793 rows=10 loops=1)
Buffers: shared hit=6432 read=4564
-> Sort (cost=52551.77..52590.50 rows=15492 width=65) (actual time=3119.788..3119.789 rows=10 loops=1)
Sort Key: (word_similarity('1, 20 kle'::text, address_concat)) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=6432 read=4564
-> Bitmap Heap Scan on tsearch_address_detail (cost=252.06..52216.99 rows=15492 width=65) (actual time=547.976..3119.067 rows=394 loops=1)
Recheck Cond: ('1, 20 kle'::text <% address_concat)
Rows Removed by Index Recheck: 3791
Heap Blocks: exact=3991
Buffers: shared hit=6429 read=4564
-> Bitmap Index Scan on idx_places_trgm_gin_addr (cost=0.00..248.19 rows=15492 width=0) (actual time=547.380..547.380 rows=4185 loops=1)
Index Cond: ('1, 20 kle'::text <% address_concat)
Buffers: shared hit=3991 read=3011
Planning Time: 44.701 ms
Execution Time: 3120.052 ms
(16 rows)
Table size is 3026MB and GIN index size is 293 MB.
I've increased shared_buffers to 800MB, work_mem=1500MB, effective_cache_size=2GB.
Are there any optimisations I can make or should I be building the data set differently for better searching by pg_trgm? Or using a combination of tsearch and pg_trgm?
Thanks
Sumit
I am trying to get a functioning postgres address search capability for Australian addresses using tsearch or pg_trgm. pg_trgm is actually better suited in this case as it allows progressive entry e.g. ' 1, 20 Kle' gives a good set of results across Kelm Avenue, Kleins Av etc.
Good performance with tsearch (<1s response times) for non progressive matches but the pg_trgm performance varies from 3.5 to 15 seconds.
Table is very simple:
CREATE TABLE address_search.tsearch_address_detail (
address_detail_pid character varying(15) NOT NULL,
address_state_abbreviation character varying(3) NOT NULL,
address_concat text not null,
address_concat_ts tsvector
);
CREATE INDEX idx_places_trgm_gin_addr ON address_search.tsearch_address_detail USING gin(address_concat gin_trgm_ops);
CREATE INDEX idx_places_trgm_gist_addr ON address_search.tsearch_address_detail USING gist(address_concat gist_trgm_ops);
CREATE INDEX idx_places_ts_gin_addr ON address_search.tsearch_address_detail USING GIN (address_concat_ts);
Actual data looks like the below:
address_detail_pid | address_state_abbreviation | address_concat | address_concat_ts
--------------------+----------------------------+---------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------
GAWA_163274127 | WA | Unit 1, 20 Klem Avenue, Salter Point, WA 6152 | '1':2 '20':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
GAWA_163274129 | WA | Unit 2, 20 Klem Avenue, Salter Point, WA 6152 | '2':2 '20':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
GAWA_163274130 | WA | Unit 2, 3 Klem Avenue, Salter Point, WA 6152 | '2':2 '3':3 '6152':9 'avenu':5 'klem':4 'point':7 'salter':6 'unit':1 'wa':8
GAWA_163274255 | WA | 11 Lancefield Street, Laverton, WA 6440 | '11':1 '6440':6 'lancefield':2 'laverton':4 'street':3 'wa':5
GAWA_163274256 | WA | 13 Lancefield Street, Laverton, WA 6440 | '13':1 '6440':6 'lancefield':2 'laverton':4 'street':3 'wa':5
and query is:
SELECT address_detail_pid, address_concat, word_similarity('1, 20 kle', address_concat) AS sml
FROM address_search.tsearch_address_detail
WHERE '1, 20 kle' <% address_concat
ORDER BY sml DESC limit 10;
The explain (analyze, buffer) is (https://explain.depesz.com/s/tvZ9):
----------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=52551.77..52551.79 rows=10 width=65) (actual time=3119.791..3119.793 rows=10 loops=1)
Buffers: shared hit=6432 read=4564
-> Sort (cost=52551.77..52590.50 rows=15492 width=65) (actual time=3119.788..3119.789 rows=10 loops=1)
Sort Key: (word_similarity('1, 20 kle'::text, address_concat)) DESC
Sort Method: top-N heapsort Memory: 26kB
Buffers: shared hit=6432 read=4564
-> Bitmap Heap Scan on tsearch_address_detail (cost=252.06..52216.99 rows=15492 width=65) (actual time=547.976..3119.067 rows=394 loops=1)
Recheck Cond: ('1, 20 kle'::text <% address_concat)
Rows Removed by Index Recheck: 3791
Heap Blocks: exact=3991
Buffers: shared hit=6429 read=4564
-> Bitmap Index Scan on idx_places_trgm_gin_addr (cost=0.00..248.19 rows=15492 width=0) (actual time=547.380..547.380 rows=4185 loops=1)
Index Cond: ('1, 20 kle'::text <% address_concat)
Buffers: shared hit=3991 read=3011
Planning Time: 44.701 ms
Execution Time: 3120.052 ms
(16 rows)
Table size is 3026MB and GIN index size is 293 MB.
I've increased shared_buffers to 800MB, work_mem=1500MB, effective_cache_size=2GB.
Are there any optimisations I can make or should I be building the data set differently for better searching by pg_trgm? Or using a combination of tsearch and pg_trgm?
Thanks
Sumit