On 28 Feb 08, at 2256, Kenneth Marshall wrote: > It may be that the software RAID 5 is your problem. Without the > use of NVRAM for a cache, all of the writes need all 3 disks. > That will cause quite a bottle-neck. In general, RAID5 writes require two reads and two writes, independent of the size of the RAID5 assemblage. To write a given block, you read the previous contents of the block you are updating and the associated parity block. You XOR the previous contents with the parity, thus stripping it out, and then XOR the new contents in. You then write the new contents to the data block and the updated parity to the parity block. New Partity = Old Parity xor Old Contents xor New Contents In the absence of NVRAM this requires precisely four disk operations, two reads followed by two writes. A naive implementation would, as you imply, use all the spindles. It would read contents of the parity stripe from the spindles not directly involved in the update, compute the new parity block, and then write the data block and the new parity. For an N disk RAID5 assemblage that's N-2 reads followed by 2 writes, N operations. Now as it happens, for the pathological case of a 3-disk RAID5 assemblage, the naive implementation is better than the more standard implementation. I don't know if any real-world code is optimised for this corner case. I would doubt it: software RAID5 is a performance disaster area at the best of times unless it can take advantage of intimate knowledge of the intent log in the filesystem (RAID-Z does this), and three-disk RAID5 assemblages are a performance disaster area irrespective of hardware in a failure scenario. The rebuild will involve taking 50% of the IO bandwidth of the two remaining disks in order to saturate the new target; rebuild performance --- contrary to intuition --- improves with larger assemblages as you can saturate the replacement disk with less and less of the bandwidth of the surviving spindles. For a terabyte, 3x500GB SATA drives in a RAID5 group will be blown out of the water by 4x500GB SATA drives in a RAID 0+1 configuration in terms of performance and (especially) latency, especially if it can do the Solaris trick of not faulting an entire RAID 0 sub-group if one spindle fails. Rebuild still isn't pretty, mind you. ian ---- Cyrus Home Page: http://cyrusimap.web.cmu.edu/ Cyrus Wiki/FAQ: http://cyrusimap.web.cmu.edu/twiki List Archives/Info: http://asg.web.cmu.edu/cyrus/mailing-list.html