On Sun, Aug 30, 2009 at 7:38 PM, Greg Stark<gsstark@xxxxxxx> wrote: > On Sun, Aug 30, 2009 at 11:56 PM, Merlin Moncure<mmoncure@xxxxxxxxx> wrote: >> 192k written >> raid 10: six writes >> raid 5: four writes, one read (but the read and one of the writes is >> same physical location) >> >> now, by 'same physical' location, that may mean that the drive head >> has to move if the data is not in cache. >> >> I realize that many raid 5 implementations tend to suck. That said, >> raid 5 should offer higher theoretical performance for writing than >> raid 10, both for sequential and random. > > In the above there are two problems. > > 1) 192kB is not a random access pattern. Any time you're writing a > whole raid stripe or more then RAID5 can start performing reasonably > but that's not random, that's sequential i/o. The relevant random i/o > pattern is writing 8kB chunks at random offsets into a multi-terabyte > storage which doesn't fit in cache. > > 2) It's not clear but I think you're saying "but the read and one of > the writes is same physical location" on the basis that this mitigates > the costs. In fact it's the worst case. It means after doing the read > and calculating the parity block the drive must then spin a full > rotation before being able to write it back out. So instead of an > average latency of 1/2 of a rotation you have that plus a full > rotation, or 3x as much latency before the write can be performed as > without raid5. > > It's not a fault of the implementations, it's a fundamental problem > with RAId5. Even a spectacular implementation of RAID5 will be awful > for random access writes. The only saving grace some hardware > implementations have is having huge amounts of battery backed cache > which mean that they can usually buffer all the writes for long enough > that the access patterns no longer look random. If you buffer enough > then you can hope you'll eventually overwrite the whole stripe and can > write out the new parity without reading the old data. Or failing that > you can perform the reads of the old data when it's convenient because > you're reading nearby data effectively turning it into sequential i/o. I agree, that's good analysis. The main point I was making was that if you have say a 10 disk raid 5, you don't involve 10 disks, only two...a very common misconception. I made another mistake that you didn't catch: you need to read *both* the data drive and the parity drive before writing, not just the parity drive. I wonder if flash SSD are a better fit for raid 5 since the reads are much cheaper than writes and there is no rotational latency. (also, $/gb is different, and so are the failure cases). merlin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance