Queries running forever, because of wrong rowcount estimate

Peter <pmc@xxxxxxxxxxxxxxxxxxxxxxx> · Sun, 12 Feb 2023 01:42:35 +0100

TL;DR Version:
==============
For a table where all rows have been deleted, under certain conditions
the planner does assume that it contains the values that were
previousely present, and will therefore assume that nothing needs to
be added, while in fact everything needs to be added.
-----
This is
PostgreSQL 12.13 on amd64-portbld-freebsd13.1, compiled by FreeBSD
clang version 13.0.0 (git@xxxxxxxxxx:llvm/llvm-project.git
llvmorg-13.0.0-0-gd7b669b3a303), 64-bit

and if somebody can tell me this has been fixed in a more recent
version, then I am happy.
-------------------------------------------------------

More elaborate Version:
=======================
When you have a table populated, pg_class may look like this:

 relpages | reltuples                                                                              
----------+-----------                                                                             
        1 |         3                                                                              

Now when you delete all rows, this will not change, and the planner
will now make the -wrong- assumption that the tuples are still
present, and planning may be bad. That is why ANALYZE is now
needed, and after ANALYZE things look like this, and do work again:

 relpages | reltuples 
----------+-----------
        1 |         0

But, if you happen to run a VACUUM ANALYZE on that table, then things
will look like this:

 relpages | reltuples 
----------+-----------
        0 |         0

And now, for reasons unknown to me, the planner will again make the
assupltion that all the previous values are still present in the
table, and planning will be bad. And at that point you can run
ANALYZE as often as you want, it does not help anymore.
-------------------------------------------------------

Full Version:
=============

Lets start with the query:

INSERT INTO opcode (opcode)
SELECT DISTINCT
       incoming.opcode AS opcode
  FROM incoming AS incoming
       LEFT JOIN opcode AS opcode USING (opcode)
 WHERE opcode.opcode IS NULL

Table "incoming" will bring millions of rows, but column "opcode"
knows only three distinct values. Others might probably appear, so
this INSERT will add such new value to table "opcode".

So far, no problem.

But what happens if I delete all rows from "opcode" and start the
process afresh? Then the INSERT *should* reinsert all the (three)
values freshly into table "opcode", i.e. there will be *millions*
of rows "WHERE opcode.opcode IS NULL", before the "DISTINCT".

This is also not a problem. Not yet.

Now we decide that not all the rows in table "incoming" are already
processable, so we limit the whole operation to those that are:

INSERT INTO opcode (opcode)
SELECT DISTINCT
       incoming.opcode AS opcode
  FROM incoming AS incoming
       LEFT JOIN opcode AS opcode USING (opcode)
 WHERE opcode.opcode IS NULL
   AND EXISTS (
     SELECT ready.id
       FROM tmp_ready AS ready
      WHERE ready.id = incoming.id
     )

And now this query will run forever and never return:

"Unique  (cost=434008.43..434008.44 rows=1 width=6)"
"  ->  Sort  (cost=434008.43..434008.44 rows=1 width=6)"
"        Sort Key: incoming.opcode"
"        ->  Nested Loop Semi Join  (cost=1058.15..434008.42 rows=1 width=6)"
"              Join Filter: (incoming.id = pairs.ori)"
"              ->  Gather  (cost=1058.15..151427.56 rows=1 width=14)"
"                    Workers Planned: 3"
"                    ->  Hash Anti Join  (cost=58.15..150427.46 rows=1 width=14)"
"                          Hash Cond: (incoming.opcode = opcode.opcode)"
"                          ->  Parallel Seq Scan on incoming  (cost=0.00..147589.32 rows=741332 width=14)"
"                          ->  Hash  (cost=31.40..31.40 rows=2140 width=6)"
"                                ->  Seq Scan on opcode  (cost=0.00..31.40 rows=2140 width=6)"
"              ->  Seq Scan on tmp_ready ready  (cost=0.00..253869.27 rows=2296927 width=8)"

The LEFT JOIN is estimated with only one result row (we know there can
actually be millions), and therefore this goes into a Nested Loop that
will read tmp_ready forever. 

Putting an Index onto tmp_ready helps the matter - but that is not the
root cause here.
The root cause is a wrong estimate: When deleting all rows from table
"opcode", this apparently does not change the planner behaviour. The
planner still thinks that all values from the incoming.opcode column are
already in opcode.opcode.

After trying with different experiments, I finally managed to somehow
kick ANALYZE to produce working statistics, and now it looks like this:

"HashAggregate  (cost=554927.43..554927.44 rows=1 width=6)"
"  Group Key: incoming.opcode"
"  ->  Hash Semi Join  (cost=291554.87..549213.83 rows=2285442 width=6)"
"        Hash Cond: (incoming.id = ready.id)"
"        ->  Hash Anti Join  (cost=1.01..192070.20 rows=2286638 width=14)"
"              Hash Cond: (incoming.opcode = opcode.opcode)"
"              ->  Seq Scan on incoming  (cost=0.00..163157.29 rows=2298129 width=14)"
"              ->  Hash  (cost=1.00..1.00 rows=1 width=6)"
"                    ->  Seq Scan on opcode  (cost=0.00..1.00 rows=1 width=6)"
"        ->  Hash  (cost=253869.27..253869.27 rows=2296927 width=8)"
"              ->  Seq Scan on tmp_ready ready  (cost=0.00..253869.27 rows=2296927 width=8)"

The LEFT JOIN has the correct rowcount now, and obviousely no nested
stuff is attempted with that.

There are a couple of approaches how to solve this, like
 - just put an index on tmp_ready
 - just disallow the loop: SET enable_nestloop = 'f' 

But that doesn't seem professional, as they don't tackle the very place of
the defect.
The question is: how to get ANALYZE to actually do it's work?

Let's investigate:

We have the original content table "opcode":

 id | opcode 
----+--------
 29 | NOTIFY
 30 | UPDATE
 31 | QUERY

And for certain the query works (and delivers 0 result rows):

"Unique  (cost=433951.35..433951.36 rows=1 width=6) (actual time=1938.400..1941.805 rows=0 loops=1)"
"  Buffers: shared hit=2694 read=137678"
"  ->  Sort  (cost=433951.35..433951.35 rows=1 width=6) (actual time=1938.399..1941.803 rows=0 loops=1)"
"        Sort Key: incoming.opcode"
"        Sort Method: quicksort  Memory: 25kB"
"        Buffers: shared hit=2694 read=137678"
"        ->  Nested Loop Semi Join  (cost=1001.07..433951.34 rows=1 width=6) (actual time=1938.387..1941.791 rows=0 loops=1)"
"              Join Filter: (incoming.id = pairs.ori)"
"              Buffers: shared hit=2694 read=137678"
"              ->  Gather  (cost=1001.07..151370.48 rows=1 width=14) (actual time=1938.386..1941.789 rows=0 loops=1)"
"                    Workers Planned: 3"
"                    Workers Launched: 3"
"                    Buffers: shared hit=2694 read=137678"
"                    ->  Hash Anti Join  (cost=1.07..150370.38 rows=1 width=14) (actual time=1928.520..1928.524 rows=0 loops=4)"
"                          Hash Cond: (incoming.opcode = opcode.opcode)"
"                          Buffers: shared hit=2694 read=137678"
"                          ->  Parallel Seq Scan on incoming  (cost=0.00..147589.32 rows=741332 width=14) (actual time=0.199..1651.435 rows=574232 loops=4)"
"                                Buffers: shared hit=2498 read=137678"
"                          ->  Hash  (cost=1.03..1.03 rows=3 width=6) (actual time=0.063..0.066 rows=3 loops=4)"
"                                Buckets: 1024  Batches: 1  Memory Usage: 9kB"
"                                Buffers: shared hit=4"
"                                ->  Seq Scan on opcode  (cost=0.00..1.03 rows=3 width=6) (actual time=0.043..0.047 rows=3 loops=4)"
"                                      Buffers: shared hit=4"
"              ->  Seq Scan on tmp_pairs pairs  (cost=0.00..253869.27 rows=2296927 width=8) (never executed)"
"Planning Time: 0.512 ms"
"Execution Time: 1941.862 ms"

We have this data in pg_class and pg_stats:

 relpages | reltuples 
----------+-----------
        1 |         3

 attname | n_distinct | most_common_vals | most_common_freqs |   histogram_bounds    | correlation 
---------+------------+------------------+-------------------+-----------------------+-------------
 id      |         -1 |                  |                   | {29,30,31}            |           1
 opcode  |         -1 |                  |                   | {NOTIFY,QUERY,UPDATE} |         0.5

# delete from dnstap2.opcode;
DELETE 3

Stats have not changed, and the query planning is still the same,
which will, now, run forever.

# analyze dnstap2.opcode;
ANALYZE

 relpages | reltuples 
----------+-----------
        1 |         0

 attname | n_distinct | most_common_vals | most_common_freqs |   histogram_bounds    | correlation 
---------+------------+------------------+-------------------+-----------------------+-------------
 id      |         -1 |                  |                   | {29,30,31}            |           1
 opcode  |         -1 |                  |                   | {NOTIFY,QUERY,UPDATE} |         0.5

The planning has now changed, and the query works as intended.

Next table, same behaviour, so it also works.
But then, next table "proto" is empty, and here the query does run forever:

 relpages | reltuples 
----------+-----------
        0 |         0

 attname | n_distinct | most_common_vals | most_common_freqs |  histogram_bounds   | correlation 
---------+------------+------------------+-------------------+---------------------+-------------
 id      |         -1 |                  |                   | {37,38,39,40}       |           1
 proto   |         -1 |                  |                   | {TCP,TCP6,UDP,UDP6} |           1

# analyze dnstap2.proto;
ANALYZE

This does not change anything, and the query does still run forever!

Let's look closer:
"oid"		"what"		"nspname"	"relname"	"filename"	"relpages"
------------------------------------------------------------------------------------------
"9625390"	"SELF"		"dnstap2"	"proto"		"9625390"	0
"9625394"	"TOAST"		"dnstap2"	"proto"		"9625394"	0
"9625396"	"TOASTIDX"	"dnstap2"	"proto"		"9625396"	1
"9625399"	"proto_uniq"	"dnstap2"	"proto"		"9625399"	2

# dir 962539[0469]*
-rw-------  1 770  770      0 Feb 11 05:37 9625390
-rw-------  1 770  770  16384 Feb 11 05:41 9625390_fsm
-rw-------  1 770  770      0 Feb 11 05:37 9625390_vm
-rw-------  1 770  770      0 Feb  7 11:59 9625394
-rw-------  1 770  770   8192 Feb  8 04:53 9625396
-rw-------  1 770  770  16384 Feb 11 05:41 9625399

At 05:37 this was the nightly VACUUM ANALYZE (there is no autovacuum
here).

Next table: "addr"

 relpages | reltuples 
----------+-----------
        0 |         0

[stats are lenghty]

And it does not work.

Next table: "status"

 relpages | reltuples 
----------+-----------
        0 |         0

 attname | n_distinct | most_common_vals | most_common_freqs |                   histogram_bounds                    | correlation 
---------+------------+------------------+-------------------+-------------------------------------------------------+-------------
 id      |         -1 |                  |                   | {55,56,57,58,59,60}                                   |           1
 status  |         -1 |                  |                   | {BADCOOKIE,FORMERR,NOERROR,NXDOMAIN,REFUSED,SERVFAIL} |           1

And it does not work.

Okay, lets grab the "opcode" table and see further...

# VACUUM ANALYZE dnstap2.opcode;
VACUUM

 relpages | reltuples 
----------+-----------
        0 |         0

KAPUTT! Query runs now forever.

# ANALYZE dnstap2.opcode;
ANALYZE

Nada, query runs forever.

# insert into dnstap2.opcode (opcode) Values ('crap');
INSERT 0 1

 relpages | reltuples 
----------+-----------
        0 |         0

# analyze dnstap2.opcode;
ANALYZE

 relpages | reltuples 
----------+-----------
        1 |         1

Too bad, query runs forever. :(

 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation 
---------+------------+------------------+-------------------+------------------+-------------
 id      |         -1 |                  |                   |                  |            
 opcode  |         -1 |                  |                   |                  |            

Ups. When did that happen?

# vacuum analyze dnstap2.opcode;
VACUUM

# analyze dnstap2.opcode;
ANALYZE

# delete from dnstap2.opcode;
DELETE 1

# analyze dnstap2.opcode;
ANALYZE

 relpages | reltuples 
----------+-----------
        1 |         0

 attname | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation 
---------+------------+------------------+-------------------+------------------+-------------
 id      |         -1 |                  |                   |                  |            
 opcode  |         -1 |                  |                   |                  |            

Okay, my query works again!

Friends, I am not clear on WHAT exactly goes wrong here, but my gut
feeling is very strong that something does go really wrong here.

What exactly do I need to do after deleting all rows from tables?
ANALYZE obviousely.
What else?
VACUUM, apparently, absolutely NOT.

But what if it runs interim from cron schedule?
Then I would need to manually insert some values into the concerned
tables. run ANALYZE, delete the values again, run ANALYZE, and only
then operations could resume. This cannot be the idea.