Re: PG performance issues related to storage I/O waits

Tomas Vondra <tv@xxxxxxxx> · Mon, 05 Aug 2013 22:28:37 +0200

Hi,

On 5.8.2013 17:55, Tasos Petalas wrote:
> 
>     Seems most of the I/O is caused by SELECT backend processes (READ), 
>     whereas (WRITE) requests of wal writer and checkpointer processes do
>     not appear as top IO proceses (correct me if I am wrong)
> 
> E.g. check the follwoing heavy write process that reports 0% I/O ...!
>  
>  14:09:40   769 be/4 enterpri    0.00 B/s   33.65 M/s  0.00 %  0.00 %
> postgres: wal writer process 

That's because the WAL writer does sequential I/O (writes), which is a
perfect match for SAS drives.

OTOH the queries do a lot of random reads, which is a terrible match for
spinners.

>         That however still doesn't say which processes are responsible
>         for that.
>         Is that background writer, backends running queries or what? The
>         iotop
>         should give you answer to this (or at least a hint).
> 
> 
> It seems most of I/O reported from backends running heavy concurrent
> select queries (See iotop attachment in previous email)

Yes, that seems to be the case.

>         Also, how are these volumes defined? Do they use distinct sets
>         of disks?
>         How many disks are used for each volume?
> 
> 
> These are LUNs from SAN (we have dedicated 16 SAS 2,5'' disks in RAID-10
> topology in Storage)

I do understand these are LUNs from the SAN. I was asking whether there
are separate sets of disks for the data directory (which you mentioned
to be RAID-10) and pg_archives (which you mentioned to be RAID-5).

Although I doubt it'd be possible to use the same disk for two LUNs.

>         > Yes we are using 15K SAS disks in RAID 10. (253-2 dev refers
>         to sar
>         > output for disks)
> 
>         OK, so the pg_archives is probably for xlog archive, right?
> 
> NO.
> /pg_archives is the target mount_point where we copy archive_logs to
> (archive_command = 'test ! -f /pg_archives/%f && cp %p /pg_archives/%f')

... which is exactly what WAL archive is. That's why the GUC is called
archive_command.

>         I've checked the conf, and I think you should really consider
>         increasing
>         checkpoint_segments - it's set to 3 (= 64MB) but I think
>         something like
>         32 (=512MB) or even more would be more appropriate.
> 
> We use EDB dynatune. Actual setting can be found in  file
> (Ticket.Usual.Info.27.07.13.txt) of initial e-mail --> check show all;
> section
> Current checkpoint_segments is set to 64

OK, I'm not familiar with dynatune, and I got confused by the
postgresql.conf that you sent. 64 seems fine to me.

>         I see you've enabled log_checkpoints - can you check your logs
>         how often
>         the checkpoints happen?
> 
> 
> This is the output of the checkpoints during peak hours (avg. every 2-5
> minutes)
>  
> 2013-08-02 14:00:20 UTC [767]: [19752]: [0]LOG:  checkpoint complete:
> wrote 55926 buffers (5.3%); 0 transaction log file(s) added, 0 removed,
> 41 recycled; write=220.619 s, sync=
> 5.443 s, total=226.152 s; sync files=220, longest=1.433 s, average=0.024 s
> 2013-08-02 14:05:14 UTC [767]: [19754]: [0]LOG:  checkpoint complete:
> wrote 109628 buffers (10.5%); 0 transaction log file(s) added, 0
> removed, 31 recycled; write=209.714 s, syn
> c=9.513 s, total=219.252 s; sync files=222, longest=3.472 s, average=0.042 s

Meh, seems OK to me. This was based on the incorrect number of
checkpoint segments ...
> 
>  
> 
>         Also, can you check pg_stat_bgwriter view? I'd bet the value in
>         checkpoints_timed is very low, compared to checkpoints_req. Or even
>         better, get the values from this view before / after running the
>         batch jobs.
> 
> Results during load:
> checkpoints_timed : 12432 , checkpoints_req = 3058

Again, seems fine.

>         In the afternoon it's a different story - for 253-2 it looks
>         like this:
> 
>                           DEV       tps  rd_sec/s  wr_sec/s   await    %util
>         15:50:01     dev253-2   4742.91  33828.98  29156.17   84.84   105.14
>         16:00:01     dev253-2   2781.05  12737.41  18878.52   19.24    80.53
>         16:10:01     dev253-2   3661.51  20950.64  23758.96   36.86    99.03
>         16:20:01     dev253-2   5011.45  32454.33  31895.05   72.75   102.38
>         16:30:01     dev253-2   2638.08  14661.23  17853.16   25.24    75.64
>         16:40:01     dev253-2   1988.95   5764.73  14190.12   45.05    58.80
>         16:50:01     dev253-2   2185.15  88296.73  11806.38    7.46    84.37
>         17:00:01     dev253-2   2031.19  12835.56  12997.34    8.90    82.62
>         17:10:01     dev253-2   4009.24  34288.71  23974.92   38.07   103.01
>         17:20:01     dev253-2   3605.86  26107.83  22457.41   45.76    90.90
>         17:30:01     dev253-2   2550.47   7496.85  18267.07   19.10    65.87
> 
> 
> This is when the actual  problem arises

Well, then I think it's mostly about the SELECT queries.

>         What I think you could/should do:
> 
>         * move pg_xlog to a separate device (not simply a volume on the SAN,
>           sharing disks with the other volumes - that won't give you
>         anything)
> 
> Unfortunately we cannot do so at the moment (alll available SAN
> resources are assigned to the pg_data directory of the server)
>
>         I'd expect these changes to improve the afternoon peak, as it's
>         doing
>         about 50% writes. However I would not expect this to improve the
>         morning
>         peak, because that's doing a lot of reads (not writes).
> 
> Afternoon peak is what we need to troubleshoot (will check if we can
> assign pg_xlog to a different LUN - not an option currently)

OK, understood. It's difficult to predict the gain and given the iotop
output it might even cause harm.

>  
> Will SSD improve write performance? We are thinking of moving towards
> this direction.

It'll certainly improve the random I/O in general, which is the main
issue with SELECT queries. Sequential read/write improvement probably
won't be that significant.

Tomas

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance