I'd like to understand why PostgreSQL is choosing to filter on the most inefficient predicate first in the query below.
Version: PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu, compiled by gcc (SUSE Linux) 4.7.1 20120723 [gcc-4_7-branch revision 189773], 64-bit
Hardware: 24 2.93GHz Xeon cores and 32GB RAM
Table in question:
------------------------------------------------------------
CREATE TABLE audit_trail (
id int PRIMARY KEY NOT NULL,
model varchar(128) NOT NULL,
model_id int NOT NULL,
created bool,
updated bool,
deleted bool,
change_set text,
created_by varchar(64) NOT NULL,
created_date timestamp NOT NULL,
updated_by varchar(64) NOT NULL,
updated_date timestamp NOT NULL
);
CREATE UNIQUE INDEX audit_trail_pkey ON audit_trail(id);
Row count: 5,306,596
Min / Avg / Max character length of change_set column values: 11 / 165 / 12717859
Number of unique values in model column: 196
Query and plan:
------------------------------------------------------------
SET track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM audit_trail WHERE model = 'User' AND model_id = 304 AND change_set ILIKE '%test%';
Seq Scan on audit_trail (cost=0.00..243427.98 rows=1 width=189) (actual time=15509.722..17943.896 rows=6 loops=1)
Filter: ((change_set ~~* '%test%'::text) AND ((model)::text = 'User'::text) AND (model_id = 304))
Rows Removed by Filter: 5306590
Buffers: shared hit=10410 read=164384
I/O Timings: read=310.189
Total runtime: 17943.930 ms
Observations:
------------------------------------------------------------
1) Without the change_set predicate, the query runs in 1 second and returns 461 rows.
2) Turning statistics off for the change_set column has no effect on the plan or execution time:
ALTER TABLE audit_trail ALTER COLUMN change_set SET STATISTICS 0;
ANALYZE audit_trail(change_set);
3) Setting statistics to the max value has no effect on the plan or execution time:
ALTER TABLE audit_trail ALTER COLUMN change_set SET STATISTICS 10000;
ANALYZE audit_trail(change_set);
4) Adding an index on (model, model_id) changes the plan so that it starts with an index scan. Query time < 1s.
CREATE INDEX audit_trail_model_idx ON audit_trail (model, model_id);
Aggregate (cost=12.29..12.30 rows=1 width=4) (actual time=1.455..1.456 rows=1 loops=1)
-> Index Scan using audit_trail_model_idx on audit_trail this_ (cost=0.56..12.29 rows=1 width=4) (actual time=1.446..1.446 rows=0 loops=1)
Index Cond: (((model)::text = 'User'::text) AND (model_id = 304))
Filter: (change_set ~~* '%test%'::text)
Rows Removed by Filter: 461
Although adding the index will fix the performance problem, I'd like to understand why, in absence of the index, PostgreSQL would choose to filter on the change_set value first. Since it is not specified first in the predicate and its column ordinality is higher than model and model_id, the plan generator must be choosing it first for some particular reason.
Any insight is appreciated.
Version: PostgreSQL 9.3.2 on x86_64-unknown-linux-gnu, compiled by gcc (SUSE Linux) 4.7.1 20120723 [gcc-4_7-branch revision 189773], 64-bit
Hardware: 24 2.93GHz Xeon cores and 32GB RAM
Table in question:
------------------------------------------------------------
CREATE TABLE audit_trail (
id int PRIMARY KEY NOT NULL,
model varchar(128) NOT NULL,
model_id int NOT NULL,
created bool,
updated bool,
deleted bool,
change_set text,
created_by varchar(64) NOT NULL,
created_date timestamp NOT NULL,
updated_by varchar(64) NOT NULL,
updated_date timestamp NOT NULL
);
CREATE UNIQUE INDEX audit_trail_pkey ON audit_trail(id);
Row count: 5,306,596
Min / Avg / Max character length of change_set column values: 11 / 165 / 12717859
Number of unique values in model column: 196
Query and plan:
------------------------------------------------------------
SET track_io_timing = on;
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM audit_trail WHERE model = 'User' AND model_id = 304 AND change_set ILIKE '%test%';
Seq Scan on audit_trail (cost=0.00..243427.98 rows=1 width=189) (actual time=15509.722..17943.896 rows=6 loops=1)
Filter: ((change_set ~~* '%test%'::text) AND ((model)::text = 'User'::text) AND (model_id = 304))
Rows Removed by Filter: 5306590
Buffers: shared hit=10410 read=164384
I/O Timings: read=310.189
Total runtime: 17943.930 ms
Observations:
------------------------------------------------------------
1) Without the change_set predicate, the query runs in 1 second and returns 461 rows.
2) Turning statistics off for the change_set column has no effect on the plan or execution time:
ALTER TABLE audit_trail ALTER COLUMN change_set SET STATISTICS 0;
ANALYZE audit_trail(change_set);
3) Setting statistics to the max value has no effect on the plan or execution time:
ALTER TABLE audit_trail ALTER COLUMN change_set SET STATISTICS 10000;
ANALYZE audit_trail(change_set);
4) Adding an index on (model, model_id) changes the plan so that it starts with an index scan. Query time < 1s.
CREATE INDEX audit_trail_model_idx ON audit_trail (model, model_id);
Aggregate (cost=12.29..12.30 rows=1 width=4) (actual time=1.455..1.456 rows=1 loops=1)
-> Index Scan using audit_trail_model_idx on audit_trail this_ (cost=0.56..12.29 rows=1 width=4) (actual time=1.446..1.446 rows=0 loops=1)
Index Cond: (((model)::text = 'User'::text) AND (model_id = 304))
Filter: (change_set ~~* '%test%'::text)
Rows Removed by Filter: 461
Although adding the index will fix the performance problem, I'd like to understand why, in absence of the index, PostgreSQL would choose to filter on the change_set value first. Since it is not specified first in the predicate and its column ordinality is higher than model and model_id, the plan generator must be choosing it first for some particular reason.
Any insight is appreciated.