Re: a heavy duty operation on an "unused" table kills my server

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Sat, 16 Jan 2010 11:59:57 +0800

Eduardo Piombino wrote:

> I think pg is wasting resources, it could be very well taking advantage
> of, if you guys just tell me get better hardware. I mean ... the IO
> subsystem is obviously the bottleneck of my system. But most of the time
> it is on very very light load, actually ALL of the time, unless I do
> some heavy background processing like the original ALTER, or the
> procedure that updates 800.000 rows. What I would consider to be a great
> feature, would be able to tell pgsql, that a certain operation, is not
> time critical, so that it does not try to use all the available IO
> subsystem at once, but rather rationalize its use.

Rate-limiting (or preferably prioritizing) I/O from Pg would be nice.

It's already possible to prioritize I/O from Pg, though, albeit somewhat
clumsily:

  http://wiki.postgresql.org/wiki/Priorities

... as the OS provides I/O priority features. Pg shouldn't have to
re-implement those, only provide more convenient access to them.

( On Windows? Who knows. If you find out how to set I/O priorities on
Windows please extend that article! )

The trouble is that if you have a crappy RAID setup, the OS's I/O
priorities may be ineffective. The OS will do its best to prioritize
anything else over your don't-care-how-long-it-takes backend's query,
but if the RAID controller is queuing requests seconds-deep nothing the
OS does will make any real difference.

To my eternal frustration, there don't seem to be any RAID controllers
that have any concept of I/O priorities. I'd love Linux to be able to
submit requests to different queues within the controller depending on
priority, so low priority requests only got serviced if the
higher-priority queue was empty. AFAIK there isn't really anything like
that out there, though - all the RAID controllers seem to be built for
overall throughput at the expense of request latency to one extent or
another.

So ... your can prioritize I/O in the OS as much as you like, but your
RAID controller may merrily undermine all your work.

Doing it within Pg would suffer from many of the same issues. Pg has no
way to know how deeply the controller is queuing requests and when it's
actually finished a request, so it it's very hard for Pg to rate-limit
it's I/O effectively for low-priority work. It doesn't know how to
strike a balance between sending requests too fast (ruining latency for
higher priority work) and sending far too few (so taking forever for the
low priority work). What's insanely conservative on some hardware is
insanely too much to ask from other hardware. To be sure the controller
is done with a set of writes and ready for another, you'd have to
fsync() and that'd be murderous on performance, completely ruining any
benefits gained from pacing the work.

It's also complicated by the fact that Pg's architecture is very poorly
suited to prioritizing I/O based on query or process. (AFAIK) basically
all writes go through shared_buffers and the bgwriter - neither Pg nor
in fact the OS know what query or what backend created a given set of
block writes.

To be able to effectively prioritize I/O you'd really have to be able to
bypass the bgwriter, instead doing the writes direct from the low
priority backend after ionice()ing or otherwise setting up low OS-level
I/O priorities. Even then, RAID-controller level queuing and buffering
might land up giving most of the I/O bandwidth to the low priority work
anyway.

I guess some kind of dynamic rate-limiting could theoretically also
allow Pg to write at (say) 50% of the device's write capacity at any
given time, but the multiple layers of buffering and the dynamic load
changes in the system would make it incredibly hard to effectively
evaluate what the system's write capacity actually was. You'd probably
have to run a dedicated Pg benchmark to generate some parameters to
calibrate low priority write rates... but they'd still change depending
on the random vs seq I/O mix of other processes and Pg backends on the
system, the amount of fsync() activity, etc etc etc. It's a more
complicated (!) version of the problem of rate-limiting TCP/IP data sending.

( Actually, implementing something akin to TCP/IP connection rate
limiting for allocating I/O write bandwidth in low-priority connections
would be ... fascinating. I'm sure the people who write OS write
schedulers and priority systems like ionice have looked into it and
found reasons why it's not suitable. )

The point of all that rambling: it's not as easy as just adding query
priorities to Pg!

> responsible for making use of the IO subsystem (pg_bg_writer?), to use
> it in a moderately manner. That is ... This operation is not critical,
> please do not trash my system because it is not necessary. Use all the
> delays you would like, go slowly, please, I don't really care if you
> take a month.

Trouble is, that's a rather rare case. Usually you *do* care if it takes
a month vs a week, because you're worried about lock times.

> Or at least, be aware of current status of the IO system.
> If it is being busy, slow down, if it is free, speed up. Of course I
> won't care if it takes less time to complete.

There just isn't the visibility into the OS and hardware level to know
that. Alas. At best you can measure how long it takes for the OS to
return from an I/O request or fsync() ... but all the caching and
buffering and queuing means that bears little relationship to the
capacity of the system.

> Today, one can rationalize use of CPU, with a simple pg_sleep() call.
> It would be nice to have maybe an ALTER table option (for ALTERs) or an
> option in the BEGIN transaction command, that would let me say:
> BEGIN SLOW TRANSACTION;
> or BEGIN TRANSACTION RATIONALIZE_IO;
> indicating that all the IO operations that are going to be performed in
> this transaction, are not time critical, and thus, there is no need to
> put the system in risk of a IO storm, just for a silly set of updates,
> that no one is waiting for.

I'd love that myself - if it could be made to work fairly simply. I'm
not sure it can.

In reality it'd probably have to look more like:

BEGIN SLOW TRANSACTION WITH
  io_max_ops_persec = 5
  io_max_bytes_written_persec = 10000;

where those params would pretty much be "make it up and see what works"
stuff with a bit of benchmark guidance.

Maybe that'd still be useful. If so, you'd need to answer how to
separate such low-priority I/O out so the bgwriter could rate-limit it
separately, or how to bypass the bgwriter for such I/O.

--
Craig Ringer

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance