On 26/04/13 14:56, Gavin Flower wrote:
On 26/04/13 14:33, Mark Kirkwood wrote:
Recently we encountered the following unhappy sequence of events:
1/ system running happily
2/ batch load into table begins
3/ very quickly (some) preexisting queries on said table go orders of
magnitude slower
4/ database instance becomes unresponsive
5/ application outage
After looking down a few false leads, We've isolated the cause to the
following:
The accumulating in-progress row changes are such that previously
optimal plans are optimal no longer. Now this situation will fix
itself when the next autoanalyze happens (and new plan will be
chosen) - however that cannot occur until the batch load is completed
and committed (approx 70 seconds). However during that time there is
enough of a performance degradation for queries still using the old
plan to cripple the server.
Now that we know what is happening we can work around it. But I'm
wondering - is there any way (or if not should there be one) to let
postgres handle this automatically? I experimented with a quick hack
to src/backend/commands/analyze.c (attached) that lets another
session's ANALYZE see in progress rows - which works but a) may cause
other problems and b) does not help autoaanalyze which has to wait
for COMMIT + stats message.
I've attached a (synthetic) test case that shows the issue, I'll
reproduce the output below to hopefully make the point obvious:
Table "public.plan"
Column | Type | Modifiers
--------+-----------------------------+-----------
id | integer | not null
typ | integer | not null
dat | timestamp without time zone |
val | text | not null
Indexes:
"plan_id" UNIQUE, btree (id)
"plan_dat" btree (dat)
"plan_typ" btree (typ)
[Session 1]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Index Scan using plan_dat on plan (cost=0.00..265.47 rows=55
width=117) (actual time=0.130..4.409 rows=75 loops=1)
Index Cond: (dat IS NOT NULL)
Filter: (typ = 3)
Rows Removed by Filter: 5960
Total runtime: 4.440 ms
(5 rows)
[Session 2]
BEGIN;
INSERT INTO plan
SELECT id + 2000001,typ,current_date + id * '1 seconds'::interval ,val
FROM plan
;
[Session 1]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Index Scan using plan_dat on plan (cost=0.00..551.35 rows=91
width=117) (actual time=0.131..202.699 rows=75 loops=1)
Index Cond: (dat IS NOT NULL)
Filter: (typ = 3)
Rows Removed by Filter: 5960
Total runtime: 202.729 ms
(5 rows)
[Session 2]
COMMIT;
[Session 1...wait for autoanalyze to finish then]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on plan (cost=407.87..44991.95 rows=10116
width=117) (actual time=2.692..6.582 rows=75 loops=1)
Recheck Cond: (typ = 3)
Filter: (dat IS NOT NULL)
Rows Removed by Filter: 19925
-> Bitmap Index Scan on plan_typ (cost=0.00..405.34 rows=20346
width=0) (actual time=2.573..2.573 rows=20000 loops=1)
Index Cond: (typ = 3)
Total runtime: 6.615 ms
Regards
Mark
Hmm...
You need to specify:
1. version of Postgres
2. Operating system
3. changes to postgresql.conf
4. CPU/RAM etc
5. anything else that might be relevant
While in general you are quite correct - in the above case (particularly
as I've supplied a test case) it should be pretty obvious that any
moderately modern version of postgres on any supported platform will
exhibit this.
I produced the above test case on Postgres 9.2.4 Ubuntu 13.04, with no
changes to the default postgresql.conf
Now our actual production server is a 32 CPU box with 512GB RAM, and 16
SAS SSD running Postgres 9.2.4 on Ubuntu 12.04. And yes there are quite
a few changes from the defaults there - and I wasted quite a lot of time
chasing issues with high CPU and RAM, and changing various configs to
see if they helped - before identifying that the issue was in progress
row changes and planner statistics. Also in the "real" case with much
bigger datasets the difference between the plan being optimal and it
*not* being optimal is a factor of 2000x elapsed time instead of a mere
50x !
regards
Mark
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance