Hello Alban, Yes I think I'm saturating the HDD IO but the problem is that the only thing that runs on that server is postgres itself. I have two tablespeces one on a SSD software raid 1 another one on a HDD software raid 1. The disk which is saturating is the HDD and the pg_xlog directory is on the SSD. So the problem seems to be that the HDD is saturating while the SSD is not. The server is able to write wal files on the SSD really fast while writing data from the wal files to the HDD is much slower. The question is how do i solve this issue. It must be a way to tell postgres to regulate the application making the INSERT requests even if there is no IO problem on the pg_xlog/SSD partition (but at the same time there is a huge one on the data/HDD partition). Something like "if total pg_xlog size is more than 20 GB" let all INSERT queries wait till the data is commited on the HDD. For the only solution I see is to manually restart the server every time the size of pg_xlog is close to filling the partition. This is definitely not an option. Michal On Friday 04 October 2013 09:01:53 Alban Hertroys wrote: > On Oct 3, 2013, at 23:56, Michal TOMA <mt@xxxxxxxxxx> wrote: > > I have a problem on my pg 9.2.4 setup (OpenSuse 12.2, kernel 3.2.13). > > My pg_xlog directory is growing uncontrolably untill it fills the > > partition. The database is under heavy write load and is spread on two > > tablesapces one on a ssd software raid1 partition and a second one on a > > hdd software raid1 partition. I have no wal archiving enabled nor any > > replication. > > > > I have tried different checkpoint related parameters without any > > noticable improvement. Now I have: > > checkpoint_completion_target = 0.9 > > wal_buffers = 8MB > > checkpoint_segments = 16 > > checkpoint_timeout = 20min > > shared_buffers = 2GB > > log_checkpoints = on > > > > This is what I can see in the log: > > 2013-10-03 13:58:56 CEST LOG: checkpoint starting: xlog > > 2013-10-03 13:59:56 CEST LOG: checkpoint complete: wrote 448 buffers > > (0.2%); 0 transaction log file(s) added, 9 removed, 18 recycled; > > write=39.144 s, sync=21.136 s, total=60.286 s; sync files=380, > > longest=14.517 s, average=0.055 s > > > > 2013-10-03 14:04:07 CEST LOG: checkpoint starting: xlog > > 2013-10-03 15:27:01 CEST LOG: checkpoint complete: wrote 693 buffers > > (0.3%); 0 transaction log file(s) added, 0 removed, 16 recycled; > > write=90.775 s, sync=4883.295 s, total=4974.074 s; sync files=531, > > longest=152.855 s, average=9.196 s 2013-10-03 15:27:01 CEST LOG: > > checkpoint starting: xlog time > > 2013-10-03 19:06:30 CEST LOG: checkpoint complete: wrote 3467 buffers > > (1.3%); 0 transaction log file(s) added, 0 removed, 16 recycled; > > write=122.555 s, sync=13046.077 s, total=13168.637 s; sync files=650, > > longest=234.697 s, average=20.069 s 2013-10-03 19:06:30 CEST LOG: > > checkpoint starting: xlog time > > 2013-10-03 22:30:25 CEST LOG: checkpoint complete: wrote 10198 buffers > > (3.9%); 0 transaction log file(s) added, 216 removed, 33 recycled; > > write=132.229 s, sync=12102.311 s, total=12234.608 s; sync files=667, > > longest=181.374 s, average=18.144 s 2013-10-03 22:30:25 CEST LOG: > > checkpoint starting: xlog time > > I'm not too familiar with checkpoint logging output, but from the looks of > it you're literally spending hours on syncing checkpoints. > > Are those disks on a RAID controller with a failed cache battery or > something? You aren't using RAID-5, are you? > > > ------------------------------------------------------------------------- > >------------------------------------ > > > > When the server is up and running under the usual load I get the > > following results: > > ------------------------------------------------------------------------- > >------------------------------------ 2 seconds per test > > O_DIRECT supported on this platform for open_datasync and open_sync. > > > > Compare file sync methods using one 8kB write: > > (in wal_sync_method preference order, except fdatasync > > is Linux's default) > > open_datasync 0.369 ops/sec > > fdatasync 0.575 ops/sec > > fsync 0.125 ops/sec > > fsync_writethrough n/a > > open_sync 0.222 ops/sec > > > > Compare file sync methods using two 8kB writes: > > (in wal_sync_method preference order, except fdatasync > > is Linux's default) > > open_datasync 0.383 ops/sec > > fdatasync 2.171 ops/sec > > fsync 1.318 ops/sec > > fsync_writethrough n/a > > open_sync 0.929 ops/sec > > > > Compare open_sync with different write sizes: > > (This is designed to compare the cost of writing 16kB > > in different write open_sync sizes.) > > 1 * 16kB open_sync write 0.079 ops/sec > > 2 * 8kB open_sync writes 0.041 ops/sec > > 4 * 4kB open_sync writes 0.194 ops/sec > > 8 * 2kB open_sync writes 0.013 ops/sec > > 16 * 1kB open_sync writes 0.005 ops/sec > > > > Test if fsync on non-write file descriptor is honored: > > (If the times are similar, fsync() can sync data written > > on a different descriptor.) > > write, fsync, close 0.098 ops/sec > > write, close, fsync 0.067 ops/sec > > > > Non-Sync'ed 8kB writes: > > write 0.102 ops/sec > > ------------------------------------------------------------------------- > >------------------------------------ > > Those numbers look bad. > > Are these the SSD's or the software RAID? > > It's almost as if you're saturating your disk I/O bandwidth. What hardware > is involved here? Or is it a kernel limitation, perhaps? > > > I need to tell to the server to limit the amount of wal files in pg_xlog > > somehow whatever the efect on the performance could be. > > I think more's at play here. Unfortunately, if it's not directly related to > the things I mentioned I can't help much. I'm a bit out of my league here > though - I already made lots of assumptions about how to interpret this > data. > > Cheers, > > Alban Hertroys > -- > If you can't see the forest for the trees, > cut the trees and you'll find there is no forest. -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general