Checkpoint execution overrun impact?

drvillo <f.vivoli@xxxxxxxxx> · Fri, 22 Apr 2011 02:21:35 -0700 (PDT)

Hi all

I'm trying to track down the causes of an application crash and reviewing PG
logs I'm seeing this:

2011-04-22 06:00:16 CEST LOG:  checkpoint complete: wrote 140 buffers
(3.4%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.937
s, sync=1.860 s, total=29.906 s
2011-04-22 06:04:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:05:20 CEST LOG:  checkpoint complete: wrote 161 buffers
(3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=32.188
s, sync=0.890 s, total=33.157 s
2011-04-22 06:09:20 CEST ERROR:  update or delete on table "automezzo"
violates foreign key constraint "fk77e102615cffe609" on table "carico"
2011-04-22 06:09:20 CEST DETAIL:  Key (id)=(13237) is still referenced from
table "carico".
2011-04-22 06:09:20 CEST STATEMENT:  delete from Automezzo where id=$1
2011-04-22 06:09:20 CEST ERROR:  current transaction is aborted, commands
ignored until end of transaction block
2011-04-22 06:09:20 CEST STATEMENT:  SELECT key from CONFIGURAZIONE LIMIT 0
2011-04-22 06:09:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:10:15 CEST LOG:  checkpoint complete: wrote 137 buffers
(3.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.406
s, sync=0.781 s, total=28.360 s
2011-04-22 06:14:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:15:14 CEST LOG:  checkpoint complete: wrote 136 buffers
(3.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.078
s, sync=0.797 s, total=27.984 s
2011-04-22 06:19:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:20:20 CEST LOG:  checkpoint complete: wrote 164 buffers
(4.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=32.938
s, sync=0.594 s, total=33.625 s
2011-04-22 06:24:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:25:20 CEST LOG:  checkpoint complete: wrote 160 buffers
(3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=31.875
s, sync=0.953 s, total=32.922 s
2011-04-22 06:29:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:30:14 CEST LOG:  checkpoint complete: wrote 130 buffers
(3.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=25.922
s, sync=0.937 s, total=26.937 s
2011-04-22 06:34:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:35:15 CEST LOG:  checkpoint complete: wrote 132 buffers
(3.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=26.578
s, sync=1.250 s, total=27.984 s
2011-04-22 06:39:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:40:15 CEST LOG:  checkpoint complete: wrote 136 buffers
(3.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.250
s, sync=0.984 s, total=28.453 s
2011-04-22 06:44:47 CEST LOG:  checkpoint starting: time
2011-04-22 06:51:41 CEST LOG:  checkpoint complete: wrote 108 buffers
(2.6%); 0 transaction log file(s) added, 0 removed, 0 recycled;
write=409.007 s, sync=4.672 s, total=414.070 s
2011-04-22 06:55:42 CEST LOG:  could not receive data from client: No
connection could be made because the target machine actively refused it.

<goes on like the above until application restart>

Now, I'm fairly new with PG and checkpoint internals but even after reading
around (PG docs,
http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm and
various other posts from this ml) I haven't been able to figure out the
following:

-given the configuration attached (which is basically a vanilla one) and the
number of buffers written at each execution, are these execution times
normal or above average? 
-in the case of the execution that overruns past the timeout, what are the
implications wrt the client application? 
-AFAIU client connections are basically stalled during checkpoints. Is it
reasonable to infer that the fact that the application blocking on a
getConnection() might be related to checkpoints being executed?
-considering some tuning on the PG side, should I try increasing
checkpoint_timeout and rising checkpoint_completion_target to lessen the
impact of IO on the client or should I shorten the period so there's less
stuff to write? from the number of buffers written on average I'd assume the
first option is the one to go for but I might miss some bit of reasoning
here...

The server both PG and the application server (Tomcat) are running on is a
Stratus ft2300 machine, which I think is setup to do RAID1. I've read about
RAID5 not being a wise setup for disks hosting PG, what about RAID1?

Postgres version is 8.3
http://postgresql.1045698.n5.nabble.com/file/n4332601/postgresql.conf
postgresql.conf 

Any help will be greatly appreciated:) Thanks, and sorry for the long post.

cheers
Francesco

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Checkpoint-execution-overrun-impact-tp4332601p4332601.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance