Howdy folks, Recently i've run into a problem where autoanalyze is causing a query plan to flip over to using an index which is about 10x slower, and the problem is fixed by running an alayze manually. some relevant info: UPDATE sleeping_intents SET raptor_after='2020-09-14T19:21:03.581106'::timestamp, status='requires_capture', updated_at='2020-09-14T16:21:03.581104+00:00'::timestamptz WHERE sleeping_intents.id = 'r2d2dcc0-8a44-4d19-c3p0-28522233b836'::uuid AND sleeping_intents.status = 'finna' RETURNING *; The plan generated after autoanalyze is: Update on sleeping_intents (cost=0.70..7.11 rows=1 width=272) (actual time=57.945..57.945 rows=0 loops=1) Buffers: shared hit=43942 -> Index Scan using sleeping_intents_status_created_at_raptor_after_idx on sleeping_intents (cost=0.70..7.11 rows=1 width=272) (actual time=57.943..57.943 rows=0 loops=1) Index Cond: (status = 'init'::text) Filter: (id = 'r2d2dcc0-8a44-4d19-c3p0-28522233b836'::uuid) Rows Removed by Filter: 1262 Buffers: shared hit=43942 Planning time: 0.145 ms Execution time: 57.981 ms after i run analyze manually, the query plan is changed to this: Update on sleeping_intents (cost=0.57..8.59 rows=1 width=272) (actual time=0.023..0.023 rows=0 loops=1) Buffers: shared hit=7 -> Index Scan using sleeping_intents_pkey on sleeping_intents (cost=0.57..8.59 rows=1 width=272) (actual time=0.022..0.022 rows=0 loops=1) Index Cond: (id = 'r2d2dcc0-8a44-4d19-c3p0-28522233b836'::uuid) Filter: (status = 'init'::text) Rows Removed by Filter: 1 Buffers: shared hit=7 Planning time: 0.092 ms Execution time: 0.066 ms Note that in the second query, it switches back to using the primary key index, which does seem like the logically better choice, even though it shows a higher estimated cost than the "bad" case (understanding the cost must change somewhere in the process, but there no way to see it afaict). In trying to determine why it switches, I dug up some likely useful info: Index definitions: (20 GB) "sleeping_intents_pkey" PRIMARY KEY, btree (id) (37 GB) "sleeping_intents_status_created_at_raptor_after_idx" btree (status, created_at DESC, raptor_after DESC) Basic info on the table: > select relid::regclass, n_live_tup,n_mod_since_analyze,analyze_count,autoanalyze_count from pg_stat_user_tables where relname='sleeping_intents'; relid | n_live_tup | n_mod_since_analyze | analyze_count | autoanalyze_count -----------------+------------+---------------------+---------------+------------------- sleeping_intents | 491171179 | 1939347 | 4 | 80 (that num mods is in the last ~5 hours, the table is fairly active, although on a relatively small portion of the data) Statistics after manual analyze: tablename | attname | null_frac | avg_width | n_distinct | correlation | most_common_freqs -----------------+---------------+-----------+-----------+------------+-------------+-------------------------------------------------------- sleeping_intents | id | 0 | 16 | -1 | -0.00133045 | [null] sleeping_intents | status | 0 | 9 | 6 | 0.848468 | {0.918343,0.0543667,0.0267567,0.000513333,1e-05,1e-05} sleeping_intents | created_at | 0 | 8 | -1 | 0.993599 | [null] sleeping_intents | raptor_after | 0.0663433 | 8 | -0.933657 | 0.99392 | [null] In a previous go around with this table, I also increased the statistics target for the id column to 1000, vs 100 which is the database default. Originally I was mostly interested in trying to understand why it would choose something other than the non-pk index, which sort of feels like a bug; what could be faster than seeking an individual entry in a pk index? There are cases where it might make sense, but this doesn't seem like one (even accounting for the infrequency of the status we are looking for, which is 1e-05, the disparity in index size should push it back to the pk imho, unless I am not thinking through correlation enough?). However, it also seems very odd that this problem occurs at all. In the last couple of times this has happened, the manual analyze has been run within ~30-45 minutes of the auto-analyze, and while the data is changing, it isn't changing that rapidly that this should make a significant difference, but I don't see any other reason that autoanalyze would produce a different result than manual analyze. All that said, any insight on the above two items would be great, but the most immediate concern would be around suggestions for preventing this from happening again? Thanks in advance, Robert Treat https://xzilla.net