Re: In progress INSERT wrecks plans on table

Mark Kirkwood <mark.kirkwood@xxxxxxxxxxxxxxx> · Fri, 26 Apr 2013 15:19:53 +1200

On 26/04/13 14:56, Gavin Flower wrote:
On 26/04/13 14:33, Mark Kirkwood wrote:
Recently we encountered the following unhappy sequence of events:

1/ system running happily
2/ batch load into table begins
3/ very quickly (some) preexisting queries on said table go orders of 
magnitude slower
4/ database instance becomes unresponsive
5/ application outage

After looking down a few false leads, We've isolated the cause to the 
following:

The accumulating in-progress row changes are such that previously 
optimal plans are optimal no longer. Now this situation will fix 
itself when the next autoanalyze happens (and new plan will be 
chosen) - however that cannot occur until the batch load is completed 
and committed (approx 70 seconds). However during that time there is 
enough of a performance degradation for queries still using the old 
plan to cripple the server.

Now that we know what is happening we can work around it. But I'm 
wondering - is there any way (or if not should there be one) to let 
postgres handle this automatically? I experimented with a quick hack 
to src/backend/commands/analyze.c (attached) that lets another 
session's ANALYZE see in progress rows - which works but a) may cause 
other problems and b) does not help autoaanalyze which has to wait 
for COMMIT + stats message.

I've attached a (synthetic) test case that shows the issue, I'll 
reproduce the output below to hopefully make the point obvious:

               Table "public.plan"
 Column |            Type             | Modifiers
--------+-----------------------------+-----------
 id     | integer                     | not null
 typ    | integer                     | not null
 dat    | timestamp without time zone |
 val    | text                        | not null
Indexes:
    "plan_id" UNIQUE, btree (id)
    "plan_dat" btree (dat)
    "plan_typ" btree (typ)

[Session 1]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------- 

 Index Scan using plan_dat on plan  (cost=0.00..265.47 rows=55 
width=117) (actual time=0.130..4.409 rows=75 loops=1)
   Index Cond: (dat IS NOT NULL)
   Filter: (typ = 3)
   Rows Removed by Filter: 5960
 Total runtime: 4.440 ms
(5 rows)

[Session 2]

BEGIN;
INSERT INTO plan
SELECT id + 2000001,typ,current_date + id * '1 seconds'::interval ,val
FROM plan
;

[Session 1]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------- 

 Index Scan using plan_dat on plan  (cost=0.00..551.35 rows=91 
width=117) (actual time=0.131..202.699 rows=75 loops=1)
   Index Cond: (dat IS NOT NULL)
   Filter: (typ = 3)
   Rows Removed by Filter: 5960
 Total runtime: 202.729 ms
(5 rows)
[Session 2]
COMMIT;

[Session 1...wait for autoanalyze to finish then]

EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------- 

 Bitmap Heap Scan on plan  (cost=407.87..44991.95 rows=10116 
width=117) (actual time=2.692..6.582 rows=75 loops=1)
   Recheck Cond: (typ = 3)
   Filter: (dat IS NOT NULL)
   Rows Removed by Filter: 19925
   ->  Bitmap Index Scan on plan_typ  (cost=0.00..405.34 rows=20346 
width=0) (actual time=2.573..2.573 rows=20000 loops=1)
         Index Cond: (typ = 3)
 Total runtime: 6.615 ms

Regards

Mark

Hmm...

You need to specify:

1. version of Postgres
2. Operating system
3. changes to postgresql.conf
4. CPU/RAM etc
5. anything else that might be relevant

While in general you are quite correct - in the above case (particularly 
as I've supplied a test case) it should be pretty obvious that any 
moderately modern version of postgres on any supported platform will 
exhibit this.

I produced the above test case on Postgres 9.2.4 Ubuntu 13.04, with no 
changes to the default postgresql.conf

Now our actual production server is a 32 CPU box with 512GB RAM, and 16 
SAS SSD running Postgres 9.2.4 on Ubuntu 12.04. And yes there are quite 
a few changes from the defaults there - and I wasted quite a lot of time 
chasing issues with high CPU and RAM, and changing various configs to 
see if they helped - before identifying that the issue was in progress 
row changes and planner statistics.  Also in the "real" case with much 
bigger datasets the difference between the plan being optimal and it 
*not* being optimal is a factor of 2000x elapsed time instead of a mere 
50x  !

regards

Mark

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance