Re: a heavy duty operation on an "unused" table kills my server

Craig Ringer <craig@xxxxxxxxxxxxxxxxxxxxx> · Wed, 13 Jan 2010 18:58:43 +0800

On 13/01/2010 3:03 PM, Eduardo Piombino wrote:
One last question, this IO issue I'm facing, do you think it is just a
matter of RAID configuration speed, or a matter of queue gluttony (and
not leaving time for other processes to get into the IO queue in a
reasonable time)?

Hard to say with the data provided. It's not *just* a matter of a slow 
array, but that might contribute.

Specifically, though, by "slow array" in this case I'm looking at 
latency rather than throughput, particularly read latency under heavy 
write load. Simple write throughput isn't really the issue, though bad 
write throughput can make it fall apart under a lighter load than it 
would otherwise.

High read latencies may not be caused by deep queuing, though that's one 
possible cause. A controller that prioritizes batching sequential writes 
efficiently over serving random reads would cause it too - though 
reducing its queue depth so it can't see as many writes to batch would help.

Let me stress, again, that if you have a decent RAID controller with a 
battery backed cache unit you can enable write caching and most of these 
issues just go away. Using an array format with better read/write 
concurrency, like RAID 10, may help as well.

Honestly, though, at this point you need to collect data on what the 
system is actually doing, what's slowing it down and where. *then* look 
into how to address it. I can't advise you much on that as you're using 
Windows, but there must be lots of info on optimising windows I/O 
latencies and throughput on the 'net...

Because if it was just a matter of speed, ok, with my actual RAID
configuration lets say it takes 10 minutes to process the ALTER TABLE
(leaving no space to other IOs until the ALTER TABLE is done), lets say
then i put the fastest possible RAID setup, or even remove RAID for the
sake of speed, and it completes in lets say again, 10 seconds (an unreal
assumption). But if my table now grows 60 times, I would be facing the
very same problem again, even with the best RAID configuration.

Only if the issue is one of pure write throughput. I don't think it is. 
You don't care how long the ALTER takes, only how much it impacts other 
users. Reducing the impact on other users so your ALTER can complete in 
its own time without stamping all over other work is the idea.

The problem would seem to be in the way the OS (or hardware, or someone
else, or all of them) is/are inserting the IO requests into the queue.

It *might* be. There's just not enough information to tell that yet. 
You'll need to do quite a bit more monitoring. I don't have the 
expertise to advise you on what to do and how to do it under Windows.

What can I do to control the order in which these IO requests are
finally entered into the queue?

No idea. You probably need to look into I/O priorities on Windows.

Ideally you shouldn't have to, though. If you can keep read latencies at 
sane levels under high write load on your array, you don't *need* to 
mess with this.

Note that I'm still guessing about the issue being high read latencies 
under write load. It fits what you describe, but there isn't enough data 
to be sure, and I don't know how to collect it on Windows.

What cards do I have to manipulate the order the IO requests are entered
into the "queue"?
Can I disable this queue?
Should I turn disk's IO operation caches off?
Not use some specific disk/RAID  vendor, for instance?

Don't know. Contact your RAID card tech support, Google, search MSDN, etc.

--
Craig Ringer

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance