Thanks for the quick followups folks .... please note I am deliberately running a setup without commit guarantees, so a lot of the conventional advice about not doing small writes isn't applicable, but I do want to understand more about how this affects PG internals even if the I/O is smoothed out.
By "commit consistency off" I am referring to the setting "synchronous_commit = off" in postgresql.conf .... IIRC this should mean semantically that a DB crash may lose data that was reported back to the app as successfully committed, but will result in a consistent state on reboot and recovery. In this case IIUC the "120 commits per second per drive" limit does not apply, and I hope the advice about testing fsync is similarly not applicable to my case. Also, IIUC that settings like commit_siblings and commit_delay should be ignored by PG in my case.
I would be interested in learning what the **in-memory** constraints and costs are on the PG server side of doing a lot of small commits when sync wrties are off, e.g. the implications for the locking system, and whether this can affect the long queries on the other tables apart from general resource contention.
The pattern of lots of tiny transactions is semantically correct for the app, and I am using a JDBC prepared statement on the Java side, which I believe the PG driver will turn in to a pre-compiled statement with enough uses (it does NOT do so on the first few hits). This should in theory be even cheaper than a multiple INSERT VALUES which is all text and has to be parsed.
However, if necessary for performance I can bundle the inserts into slightly larger transactions - cases where writes fail are due only to occasional duplicates (same primary key) coming from upstream and are pretty rare, and in practice losing a batch of say 100 of these records occasionally is not a big deal in my world (ignoring sound of cringing DBAs :) so I could afford to bundle into transactions and then just drop a whole bundle if any single write has a primary key collision.
Storage setup varies by customer, but a typical setup is to take RAID groups of about 5-10TB each net from something like an EMC Clariion and slice each group into 1TB LUNs which become VMWare datastores, which are written simultaneously from multiple hosts. A mid-size Clariion would host perhaps 50-100 of these small LUNs, and a customer running a high performance environment might have Fibrechannel disks and RAID-10, but SATA and RAID-5/6 would also be normal, albeit with a substantial write-back cache (maybe 1GB, IIRC a current Clariion SP has 4GB total). Each file on the datastore corresponds to a virtual disk on a VM, and the datastore is formatted with VMFS (concurrent writer filesystem, uses SCSI locking to control access to block allocation and directory entries).
The other type of VMWare datastore works at the filesystem layer - instead of a shared SAN with iSCSI / FC-AL, the VMware hosts are all pointed at a shared NFS server directory. NetApp is the popular back end for this configuration.
On top of this virtualization, I have PG laid out on two virtual disks - WAL and log files are on the main system partition, index and table data on a second partition. Both formatted with ext3fs.
One of my larger customers had his SAN guy complain to him that our app was writing more data to the NetApp it was on than every other app combined, so I am mindful of the volume being more than some of these systems were planned for :)
Cheers
Dave
On Tue, Jan 24, 2012 at 3:09 PM, Tomas Vondra <tv@xxxxxxxx> wrote:
On 24 Leden 2012, 21:16, Dave Crooke wrote:Each storage device has some basic I/O limits - sequential speed (read/write)
> Hi folks
>
> This could be a sheer volume issue, but I though I would ask the wisdom of
> this forum as to next investigative steps.
>
> ----
>
> We use PostgreSQL 8.4.4 which is bundled with our application as a VMware
> virtual appliance. The bulk of the app's database activity is recording
> performance data points which arrive in farily large sustained bursts of
> perhaps 10,000 rows a minute at a medium sized customer, each of which are
> logically separate items and being committed as individual transactions
> (JDBC auto-commit mode). Our offshore QA team was assigned to track an
> intermittent issue with speed of some large queries on other tables, and
> they believe based on correlation the two activities may be contending.
>
> The large query is coming off of different tables from the ones being
> written to ... the raw data goes into a table named by day (partitioning
> is
> all within the app, not PG) e.g. PERF_RAW_2012_01_24 and then there are a
> bunch of rollup statements which run hourly to do the aggregations, e.g.
and the maximum number or I/O operations it can handle. For example a 7.2k
drives can do up to 160MB/s sequential reads/writes, but not more than 120
I/O ops per second. Similarly for other devices - 15k drives can do up to
250 I/Os. SSDs can handle much more I/Os, e.g. Intel 320 can handle about
8k I/Os.
I have no idea what kind of storage device you're using and what amount of
sequential and random operations it can handle. But my guess you're hitting
the limit of random I/Os - each commit requires a fsync, and you're doing
10.000 of them per minute, i.e. about 160 per second. If the queries need
to read data from the drive (e.g. randomly), this just adds more I/Os.
The first thing you should do is grouping the inserts to one transaction.
> Is there any tweaking we should do on the PG settings, or on the pattern
> in
> which the app is writing - we currently use 10 writer threads on the Java
> side and they keep PG going pretty good.
That'll lower the number of I/Os the database needs to do. Besides that,
you can move the WAL to a separate (physical) device, thus spreading the
I/Os to more drives.
What do you mean by "commit consistency off"?
> I considered bundling the writes into larger transactions, will that
> really
> help much with commit consistency off?
The first thing you should do is gathering some basic I/O stats.
> Is there some specific "usual suspect" stuff I should look at on the PG
> side to look for efficiency issues such as index lock contention or a poor
> buffer cache hit ratio? Will doing EXPLAIN ANALYSE on the big query be
> informative, and if so, does it need to be done while the write load is
> applied?
Run pg_test_fsync (a contrib module) to see how many fsync operations the
I/O subsystem can handle (if it reports more than 500, use "-o" to get it
running for a longer time).
Then gather "vmstat 1" and "iostat -x 1" for a few seconds when the workload
(inserts and queries) are actually running. That should tell you how the
drives are actually utilized.
Post these results to this list.
There's a nice page about tuning at the wiki:
> Relevant schema and config attached, all comments and advice welcome,
> including general tuning tips and rationale for moving to PG 9.x .... I'm
> well aware this isn't the acme of PG tuning :)
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
I'd recommend significantly increasing the number of checkpoint segments,
e.g. to 64 (1GB) and setting completion target to 0.9. This usually helps
write-heavy workloads. And enable log_checkpoints.
Tomas