Eduardo Piombino wrote: > I think pg is wasting resources, it could be very well taking advantage > of, if you guys just tell me get better hardware. I mean ... the IO > subsystem is obviously the bottleneck of my system. But most of the time > it is on very very light load, actually ALL of the time, unless I do > some heavy background processing like the original ALTER, or the > procedure that updates 800.000 rows. What I would consider to be a great > feature, would be able to tell pgsql, that a certain operation, is not > time critical, so that it does not try to use all the available IO > subsystem at once, but rather rationalize its use. Rate-limiting (or preferably prioritizing) I/O from Pg would be nice. It's already possible to prioritize I/O from Pg, though, albeit somewhat clumsily: http://wiki.postgresql.org/wiki/Priorities ... as the OS provides I/O priority features. Pg shouldn't have to re-implement those, only provide more convenient access to them. ( On Windows? Who knows. If you find out how to set I/O priorities on Windows please extend that article! ) The trouble is that if you have a crappy RAID setup, the OS's I/O priorities may be ineffective. The OS will do its best to prioritize anything else over your don't-care-how-long-it-takes backend's query, but if the RAID controller is queuing requests seconds-deep nothing the OS does will make any real difference. To my eternal frustration, there don't seem to be any RAID controllers that have any concept of I/O priorities. I'd love Linux to be able to submit requests to different queues within the controller depending on priority, so low priority requests only got serviced if the higher-priority queue was empty. AFAIK there isn't really anything like that out there, though - all the RAID controllers seem to be built for overall throughput at the expense of request latency to one extent or another. So ... your can prioritize I/O in the OS as much as you like, but your RAID controller may merrily undermine all your work. Doing it within Pg would suffer from many of the same issues. Pg has no way to know how deeply the controller is queuing requests and when it's actually finished a request, so it it's very hard for Pg to rate-limit it's I/O effectively for low-priority work. It doesn't know how to strike a balance between sending requests too fast (ruining latency for higher priority work) and sending far too few (so taking forever for the low priority work). What's insanely conservative on some hardware is insanely too much to ask from other hardware. To be sure the controller is done with a set of writes and ready for another, you'd have to fsync() and that'd be murderous on performance, completely ruining any benefits gained from pacing the work. It's also complicated by the fact that Pg's architecture is very poorly suited to prioritizing I/O based on query or process. (AFAIK) basically all writes go through shared_buffers and the bgwriter - neither Pg nor in fact the OS know what query or what backend created a given set of block writes. To be able to effectively prioritize I/O you'd really have to be able to bypass the bgwriter, instead doing the writes direct from the low priority backend after ionice()ing or otherwise setting up low OS-level I/O priorities. Even then, RAID-controller level queuing and buffering might land up giving most of the I/O bandwidth to the low priority work anyway. I guess some kind of dynamic rate-limiting could theoretically also allow Pg to write at (say) 50% of the device's write capacity at any given time, but the multiple layers of buffering and the dynamic load changes in the system would make it incredibly hard to effectively evaluate what the system's write capacity actually was. You'd probably have to run a dedicated Pg benchmark to generate some parameters to calibrate low priority write rates... but they'd still change depending on the random vs seq I/O mix of other processes and Pg backends on the system, the amount of fsync() activity, etc etc etc. It's a more complicated (!) version of the problem of rate-limiting TCP/IP data sending. ( Actually, implementing something akin to TCP/IP connection rate limiting for allocating I/O write bandwidth in low-priority connections would be ... fascinating. I'm sure the people who write OS write schedulers and priority systems like ionice have looked into it and found reasons why it's not suitable. ) The point of all that rambling: it's not as easy as just adding query priorities to Pg! > responsible for making use of the IO subsystem (pg_bg_writer?), to use > it in a moderately manner. That is ... This operation is not critical, > please do not trash my system because it is not necessary. Use all the > delays you would like, go slowly, please, I don't really care if you > take a month. Trouble is, that's a rather rare case. Usually you *do* care if it takes a month vs a week, because you're worried about lock times. > Or at least, be aware of current status of the IO system. > If it is being busy, slow down, if it is free, speed up. Of course I > won't care if it takes less time to complete. There just isn't the visibility into the OS and hardware level to know that. Alas. At best you can measure how long it takes for the OS to return from an I/O request or fsync() ... but all the caching and buffering and queuing means that bears little relationship to the capacity of the system. > Today, one can rationalize use of CPU, with a simple pg_sleep() call. > It would be nice to have maybe an ALTER table option (for ALTERs) or an > option in the BEGIN transaction command, that would let me say: > BEGIN SLOW TRANSACTION; > or BEGIN TRANSACTION RATIONALIZE_IO; > indicating that all the IO operations that are going to be performed in > this transaction, are not time critical, and thus, there is no need to > put the system in risk of a IO storm, just for a silly set of updates, > that no one is waiting for. I'd love that myself - if it could be made to work fairly simply. I'm not sure it can. In reality it'd probably have to look more like: BEGIN SLOW TRANSACTION WITH io_max_ops_persec = 5 io_max_bytes_written_persec = 10000; where those params would pretty much be "make it up and see what works" stuff with a bit of benchmark guidance. Maybe that'd still be useful. If so, you'd need to answer how to separate such low-priority I/O out so the bgwriter could rate-limit it separately, or how to bypass the bgwriter for such I/O. -- Craig Ringer -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance