On hot standby I faced with the similar problem.
The following simple query:
SELECT array_to_json(array_agg(t)) from (
select *
from main.message m
join main.message_instance mi on m.message_id = mi.message_id
join main.channel_type ct on mi.channel_type_id =
ct.channel_type_id
where m.user_id = 2152595
and ct.name = 'android'
and m.user_delete_ts is null
order by
case
when read_ts is null then 0
else 1
end,
m.message_id desc
limit 100
) t;
is planned 4.940 ms on master and *254.741* ms on standby. I tried to
investigate the reasons of so long planning on replica and determined
that index accesses on planning stage come to multiple heap scans.
Here is the execution plan statistics of query above:
Aggregate (actual time=0.641..0.642 rows=1 loops=1)
Buffers: shared hit=14
-> Subquery Scan on t (actual time=0.337..0.360 rows=2 loops=1)
Buffers: shared hit=14
-> Limit (actual time=0.292..0.301 rows=2 loops=1)
Buffers: shared hit=14
-> Sort (actual time=0.287..0.291 rows=2 loops=1)
Sort Key: (CASE WHEN (m.read_ts IS NULL) THEN 0
ELSE 1 END), m.message_id DESC
Sort Method: quicksort Memory: 27kB
Buffers: shared hit=14
-> Nested Loop (actual time=0.157..0.219 rows=2
loops=1)
Buffers: shared hit=14
-> Seq Scan on channel_type ct (actual
time=0.043..0.048 rows=1 loops=1)
Filter: (name = 'android'::text)
Rows Removed by Filter: 7
Buffers: shared hit=1
-> Nested Loop (actual time=0.098..0.148
rows=2 loops=1)
Buffers: shared hit=13
-> Index Scan using
message_user_id_idx1 on message m (actual time=0.055..0.063 rows=2 loops=1)
Index Cond: (user_id = 2152595)
Buffers: shared hit=5
-> Index Scan using
message_instance_pkey on message_instance mi (actual time=0.026..0.029
rows=1 loops=2)
Index Cond: ((message_id =
m.message_id) AND (channel_type_id = ct.channel_type_id))
Buffers: shared hit=8
The 14 accesses to buffer cache under query execution. But
pg_statio_user_tables and pg_statio_user_indexes views show the
different picture:
select
schemaname, relname, indexrelname, idx_blks_hit, idx_blks_read,
pg_size_pretty(pg_relation_size(indexrelid::regclass))
from pg_statio_user_indexes
where idx_blks_hit + idx_blks_read > 0;
schemaname | relname | indexrelname | idx_blks_hit |
idx_blks_read | pg_size_pretty
------------+------------------+-----------------------+--------------+---------------+----------------
main | channel_type | channel_type_pkey | 2
| 0 | 16 kB
main | message_instance | message_instance_pkey | 666
| 0 | 345 MB
main | message | message_pkey | 56
| 0 | 53 MB
main | message | message_user_id_idx1 | 3
| 0 | 17 MB
select
schemaname, relname, heap_blks_read, heap_blks_hit
from pg_statio_user_tables
where heap_blks_read + heap_blks_hit + idx_blks_hit + idx_blks_read > 0;
schemaname | relname | heap_blks_read | heap_blks_hit
------------+------------------+----------------+---------------
main | channel_type | 0 | 3
main | message | 0 | 8682
main | message_instance | 0 | 114922
(This experiment is carried out on test hot standby without parallel
activities and with cleared system statistics)
The vacuum on problem tables (main.message and main.message_instance) on
master node resolves the problem somehow but its often execution slows
down all queries and generally increases IO.
Is there any case to overcome the problem or it's fundamental issue and
necessary to rewrite the query to simplify planning?
--
Regards,
Maksim Milyutin