Re: Slow queries / commits, mis-configuration or hardware issues?

"Tomas Vondra" <tv@xxxxxxxx> · Mon, 14 Nov 2011 20:22:52 +0100

On 14 Listopad 2011, 19:16, Cody Caughlan wrote:
> shared_buffers = 3584MB
> wal_buffers = 16MB
> checkpoint_segments = 32
> max_wal_senders = 10
> checkpoint_completion_target = 0.9
> wal_keep_segments = 1024
> maintenance_work_mem = 256MB
> work_mem = 88MB
> shared_buffers = 3584MB
> effective_cache_size = 10GB

Seems reasonable, although I'd bump up the checkpoint_timeout (the 5m is
usually too low).

> The PGDATA dir is a RAID10 on 4 local ("ephemeral" in EC2 speak)
> drives. I ran some dd tests and received the following output:
>
> --- WRITING ---
> root@sql03:/data# time sh -c "dd if=/dev/zero of=/data/tmp/bigfile
> bs=8k count=4000000 && sync"
> 4000000+0 records in
> 4000000+0 records out
> 32768000000 bytes (33 GB) copied, 670.663 s, 48.9 MB/s
>
> real	11m52.199s
> user	0m2.720s
> sys	0m45.330s

This measures sequential write performance (and the same holds for the
read test). We need to know the random I/O performance too - use bonnie++
or similar tool.

Based on the AWS benchmarks I've seen so far, I'd expect about 90 MB/s for
sequential read/writes, and about twice that performance for a 4-drive
RAID10. So while the reads (211 MB/s) seem perfectly OK, the writes
(50MB/s) are rather slow. Have you measured this on an idle system, or
when the db was running?

See for example this:

[1] http://victortrac.com/EC2_Ephemeral_Disks_vs_EBS_Volumes
[2]
http://www.gabrielweinberg.com/blog/2011/05/raid0-ephemeral-storage-on-aws-ec2.html

> I have enabled log_checkpoints and here is a recent sample from the log:
> 2011-11-14 17:39:15 UTC pid:3965 (-0) LOG:  checkpoint starting: time
> 2011-11-14 17:43:49 UTC pid:3965 (-0) LOG:  checkpoint complete: wrote
> 16462 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 9
> recycled; write=269.978 s, sync=4.106 s, total=274.117 s; sync
> files=82, longest=2.943 s, average=0.050 s

Nothing special here - this just says that the checkpoints were timed and
finished on time (the default checkpoint timeout is 5 minutes, with
completion target 0.9 the expected checkpoint time is about 270s). Not a
checkpoint issue, probably.

> I've been collecting random samples from pg_stat_bgwriter:
> https://gist.github.com/4faec2ca9a79ede281e1

Although it's a bit difficult to interpret this (collect the data in
regular intervals - e.g. every hour - and post the differences, please),
but it seems reasonable.

> So given all this information (if you need more just let me know), is
> there something fundamentally wrong or mis-configured? Do I have an
> I/O issue?

Probably - the discrepancy between read/write performance is a bit
suspicious.

Try to watch the I/O performance when this happens, i.e. run "iostat -x"
and watch the output (especially %util, r_await, w_await) and post several
lines of the output.

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance