TL;DR Version: ============== For a table where all rows have been deleted, under certain conditions the planner does assume that it contains the values that were previousely present, and will therefore assume that nothing needs to be added, while in fact everything needs to be added. ----- This is PostgreSQL 12.13 on amd64-portbld-freebsd13.1, compiled by FreeBSD clang version 13.0.0 (git@xxxxxxxxxx:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303), 64-bit and if somebody can tell me this has been fixed in a more recent version, then I am happy. ------------------------------------------------------- More elaborate Version: ======================= When you have a table populated, pg_class may look like this: relpages | reltuples ----------+----------- 1 | 3 Now when you delete all rows, this will not change, and the planner will now make the -wrong- assumption that the tuples are still present, and planning may be bad. That is why ANALYZE is now needed, and after ANALYZE things look like this, and do work again: relpages | reltuples ----------+----------- 1 | 0 But, if you happen to run a VACUUM ANALYZE on that table, then things will look like this: relpages | reltuples ----------+----------- 0 | 0 And now, for reasons unknown to me, the planner will again make the assupltion that all the previous values are still present in the table, and planning will be bad. And at that point you can run ANALYZE as often as you want, it does not help anymore. ------------------------------------------------------- Full Version: ============= Lets start with the query: INSERT INTO opcode (opcode) SELECT DISTINCT incoming.opcode AS opcode FROM incoming AS incoming LEFT JOIN opcode AS opcode USING (opcode) WHERE opcode.opcode IS NULL Table "incoming" will bring millions of rows, but column "opcode" knows only three distinct values. Others might probably appear, so this INSERT will add such new value to table "opcode". So far, no problem. But what happens if I delete all rows from "opcode" and start the process afresh? Then the INSERT *should* reinsert all the (three) values freshly into table "opcode", i.e. there will be *millions* of rows "WHERE opcode.opcode IS NULL", before the "DISTINCT". This is also not a problem. Not yet. Now we decide that not all the rows in table "incoming" are already processable, so we limit the whole operation to those that are: INSERT INTO opcode (opcode) SELECT DISTINCT incoming.opcode AS opcode FROM incoming AS incoming LEFT JOIN opcode AS opcode USING (opcode) WHERE opcode.opcode IS NULL AND EXISTS ( SELECT ready.id FROM tmp_ready AS ready WHERE ready.id = incoming.id ) And now this query will run forever and never return: "Unique (cost=434008.43..434008.44 rows=1 width=6)" " -> Sort (cost=434008.43..434008.44 rows=1 width=6)" " Sort Key: incoming.opcode" " -> Nested Loop Semi Join (cost=1058.15..434008.42 rows=1 width=6)" " Join Filter: (incoming.id = pairs.ori)" " -> Gather (cost=1058.15..151427.56 rows=1 width=14)" " Workers Planned: 3" " -> Hash Anti Join (cost=58.15..150427.46 rows=1 width=14)" " Hash Cond: (incoming.opcode = opcode.opcode)" " -> Parallel Seq Scan on incoming (cost=0.00..147589.32 rows=741332 width=14)" " -> Hash (cost=31.40..31.40 rows=2140 width=6)" " -> Seq Scan on opcode (cost=0.00..31.40 rows=2140 width=6)" " -> Seq Scan on tmp_ready ready (cost=0.00..253869.27 rows=2296927 width=8)" The LEFT JOIN is estimated with only one result row (we know there can actually be millions), and therefore this goes into a Nested Loop that will read tmp_ready forever. Putting an Index onto tmp_ready helps the matter - but that is not the root cause here. The root cause is a wrong estimate: When deleting all rows from table "opcode", this apparently does not change the planner behaviour. The planner still thinks that all values from the incoming.opcode column are already in opcode.opcode. After trying with different experiments, I finally managed to somehow kick ANALYZE to produce working statistics, and now it looks like this: "HashAggregate (cost=554927.43..554927.44 rows=1 width=6)" " Group Key: incoming.opcode" " -> Hash Semi Join (cost=291554.87..549213.83 rows=2285442 width=6)" " Hash Cond: (incoming.id = ready.id)" " -> Hash Anti Join (cost=1.01..192070.20 rows=2286638 width=14)" " Hash Cond: (incoming.opcode = opcode.opcode)" " -> Seq Scan on incoming (cost=0.00..163157.29 rows=2298129 width=14)" " -> Hash (cost=1.00..1.00 rows=1 width=6)" " -> Seq Scan on opcode (cost=0.00..1.00 rows=1 width=6)" " -> Hash (cost=253869.27..253869.27 rows=2296927 width=8)" " -> Seq Scan on tmp_ready ready (cost=0.00..253869.27 rows=2296927 width=8)" The LEFT JOIN has the correct rowcount now, and obviousely no nested stuff is attempted with that. There are a couple of approaches how to solve this, like - just put an index on tmp_ready - just disallow the loop: SET enable_nestloop = 'f' But that doesn't seem professional, as they don't tackle the very place of the defect. The question is: how to get ANALYZE to actually do it's work? Let's investigate: We have the original content table "opcode": id | opcode ----+-------- 29 | NOTIFY 30 | UPDATE 31 | QUERY And for certain the query works (and delivers 0 result rows): "Unique (cost=433951.35..433951.36 rows=1 width=6) (actual time=1938.400..1941.805 rows=0 loops=1)" " Buffers: shared hit=2694 read=137678" " -> Sort (cost=433951.35..433951.35 rows=1 width=6) (actual time=1938.399..1941.803 rows=0 loops=1)" " Sort Key: incoming.opcode" " Sort Method: quicksort Memory: 25kB" " Buffers: shared hit=2694 read=137678" " -> Nested Loop Semi Join (cost=1001.07..433951.34 rows=1 width=6) (actual time=1938.387..1941.791 rows=0 loops=1)" " Join Filter: (incoming.id = pairs.ori)" " Buffers: shared hit=2694 read=137678" " -> Gather (cost=1001.07..151370.48 rows=1 width=14) (actual time=1938.386..1941.789 rows=0 loops=1)" " Workers Planned: 3" " Workers Launched: 3" " Buffers: shared hit=2694 read=137678" " -> Hash Anti Join (cost=1.07..150370.38 rows=1 width=14) (actual time=1928.520..1928.524 rows=0 loops=4)" " Hash Cond: (incoming.opcode = opcode.opcode)" " Buffers: shared hit=2694 read=137678" " -> Parallel Seq Scan on incoming (cost=0.00..147589.32 rows=741332 width=14) (actual time=0.199..1651.435 rows=574232 loops=4)" " Buffers: shared hit=2498 read=137678" " -> Hash (cost=1.03..1.03 rows=3 width=6) (actual time=0.063..0.066 rows=3 loops=4)" " Buckets: 1024 Batches: 1 Memory Usage: 9kB" " Buffers: shared hit=4" " -> Seq Scan on opcode (cost=0.00..1.03 rows=3 width=6) (actual time=0.043..0.047 rows=3 loops=4)" " Buffers: shared hit=4" " -> Seq Scan on tmp_pairs pairs (cost=0.00..253869.27 rows=2296927 width=8) (never executed)" "Planning Time: 0.512 ms" "Execution Time: 1941.862 ms" We have this data in pg_class and pg_stats: relpages | reltuples ----------+----------- 1 | 3 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+------------+------------------+-------------------+-----------------------+------------- id | -1 | | | {29,30,31} | 1 opcode | -1 | | | {NOTIFY,QUERY,UPDATE} | 0.5 # delete from dnstap2.opcode; DELETE 3 Stats have not changed, and the query planning is still the same, which will, now, run forever. # analyze dnstap2.opcode; ANALYZE relpages | reltuples ----------+----------- 1 | 0 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+------------+------------------+-------------------+-----------------------+------------- id | -1 | | | {29,30,31} | 1 opcode | -1 | | | {NOTIFY,QUERY,UPDATE} | 0.5 The planning has now changed, and the query works as intended. Next table, same behaviour, so it also works. But then, next table "proto" is empty, and here the query does run forever: relpages | reltuples ----------+----------- 0 | 0 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+------------+------------------+-------------------+---------------------+------------- id | -1 | | | {37,38,39,40} | 1 proto | -1 | | | {TCP,TCP6,UDP,UDP6} | 1 # analyze dnstap2.proto; ANALYZE This does not change anything, and the query does still run forever! Let's look closer: "oid" "what" "nspname" "relname" "filename" "relpages" ------------------------------------------------------------------------------------------ "9625390" "SELF" "dnstap2" "proto" "9625390" 0 "9625394" "TOAST" "dnstap2" "proto" "9625394" 0 "9625396" "TOASTIDX" "dnstap2" "proto" "9625396" 1 "9625399" "proto_uniq" "dnstap2" "proto" "9625399" 2 # dir 962539[0469]* -rw------- 1 770 770 0 Feb 11 05:37 9625390 -rw------- 1 770 770 16384 Feb 11 05:41 9625390_fsm -rw------- 1 770 770 0 Feb 11 05:37 9625390_vm -rw------- 1 770 770 0 Feb 7 11:59 9625394 -rw------- 1 770 770 8192 Feb 8 04:53 9625396 -rw------- 1 770 770 16384 Feb 11 05:41 9625399 At 05:37 this was the nightly VACUUM ANALYZE (there is no autovacuum here). Next table: "addr" relpages | reltuples ----------+----------- 0 | 0 [stats are lenghty] And it does not work. Next table: "status" relpages | reltuples ----------+----------- 0 | 0 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+------------+------------------+-------------------+-------------------------------------------------------+------------- id | -1 | | | {55,56,57,58,59,60} | 1 status | -1 | | | {BADCOOKIE,FORMERR,NOERROR,NXDOMAIN,REFUSED,SERVFAIL} | 1 And it does not work. Okay, lets grab the "opcode" table and see further... # VACUUM ANALYZE dnstap2.opcode; VACUUM relpages | reltuples ----------+----------- 0 | 0 KAPUTT! Query runs now forever. # ANALYZE dnstap2.opcode; ANALYZE Nada, query runs forever. # insert into dnstap2.opcode (opcode) Values ('crap'); INSERT 0 1 relpages | reltuples ----------+----------- 0 | 0 # analyze dnstap2.opcode; ANALYZE relpages | reltuples ----------+----------- 1 | 1 Too bad, query runs forever. :( attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+------------+------------------+-------------------+------------------+------------- id | -1 | | | | opcode | -1 | | | | Ups. When did that happen? # vacuum analyze dnstap2.opcode; VACUUM # analyze dnstap2.opcode; ANALYZE # delete from dnstap2.opcode; DELETE 1 # analyze dnstap2.opcode; ANALYZE relpages | reltuples ----------+----------- 1 | 0 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation ---------+------------+------------------+-------------------+------------------+------------- id | -1 | | | | opcode | -1 | | | | Okay, my query works again! Friends, I am not clear on WHAT exactly goes wrong here, but my gut feeling is very strong that something does go really wrong here. What exactly do I need to do after deleting all rows from tables? ANALYZE obviousely. What else? VACUUM, apparently, absolutely NOT. But what if it runs interim from cron schedule? Then I would need to manually insert some values into the concerned tables. run ANALYZE, delete the values again, run ANALYZE, and only then operations could resume. This cannot be the idea.