On 18.08.2012 09:09, Stan Hoeppner wrote: [] >>>> Output from iostat over the period in which the 4K write was done. Look >>>> at kB read and kB written: >>>> >>>> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn >>>> sdb1 0.60 0.00 1.60 0 8 >>>> sdc1 0.60 0.80 0.80 4 4 >>>> sdd1 0.60 0.00 1.60 0 8 >>>> >>>> As you can see, a single 4K read, and a few writes. You see a few blocks >>>> more written that you'd expect because the superblock is updated too. >>> >>> I'm no dd expert, but this looks like you're simply writing a 4KB block >>> to a new stripe, using an offset, but not to an existing stripe, as the >>> array is in a virgin state. So it doesn't appear this test is going to >>> trigger RMW. Don't you need now need to do another write in the same >>> stripe to to trigger RMW? Maybe I'm just reading this wrong. What is a "new stripe" and "existing stripe" ? For md raid, all stripes are equally existing as long as they fall within device boundaries, and the rest are non-existing (outside of the device). Unlike for an SSD for example, there's no distinction between places already written and "fresh", unwritten areas - all are treated exactly the same way. >> That shouldn't matter, but that is easily checked ofcourse, by writing >> some random random data first, then doing the dd 4K write also with >> random data somewhere in the same area: >> >> # dd if=/dev/urandom bs=1M count=3 of=/dev/md0 >> 3+0 records in >> 3+0 records out >> 3145728 bytes (3.1 MB) copied, 0.794494 s, 4.0 MB/s >> >> Now the first 6 chunks are filled with random data, let write 4K >> somewhere in there: >> >> # dd if=/dev/urandom bs=4k count=1 seek=25 of=/dev/md0 >> 1+0 records in >> 1+0 records out >> 4096 bytes (4.1 kB) copied, 0.10149 s, 40.4 kB/s >> >> Output from iostat over the period in which the 4K write was done: >> >> Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn >> sdb1 0.60 0.00 1.60 0 8 >> sdc1 0.60 0.80 0.80 4 4 >> sdd1 0.60 0.00 1.60 0 8 > > According to your iostat output, the IO is identical for both tests. So > either you triggered an RMW in the first test, or you haven't triggered > an RMW with either test. Your fist test shouldn't have triggered RMW. > The second one should have. Both tests did exactly the same, since in both cases the I/O requests were the same, and md treats all (written and yet unwritten) areas the same. In this test, there IS RMW cycle which is clearly shown. I'm not sure why md wrote 8Kb to sdb and sdd, and why it wrote the "extra" 4kb to sdc. Maybe it is the metadata/superblock update. But it clearly read data from sdc and wrote new data to all drives. Assuming that all drives received a 4kb write of metadata and excluding these, we'll have 4 kb written to sdb, 4kb read from sdc and 4kb written to sdd. Which is a clear RMW - suppose our new 4kb went to sdb, sdc is a second data disk for this place and sdd is the parity. It all works nicely. Overall, in order to update parity for a small write, there's no need to read and rewrite whole stripe, only the small read+write is sufficient. There are, however, 2 variants of RMW possible, and one can be choosen over another based on number of drives, amount of data being written and amount of data available in the cache. It can either read the "missing" data blocks to calculate new parity (based on new blocks and the read "missing" ones), or it can read parity block only, substract data being replaced from there (xor is nice for that), add new data and write new parity back. When you have array with large amount of drives and you write only small amount, the second approach (reading old data (which might even be in cache already!), reading the parity block, substracting old data and adding new to there, and writing new data + new parity) will be much more often than reading from all other components. I guess. So.. large chunk size is actually good, as it allows large I/Os in one go. There's a tradeoff ofcourse: the less the chunk size is, the more chances we have to write full stripe without RMW at all, but at the same time, I/O size becomes very small too, which is inefficient from the drive point of view. So there's a balance, but I guess on a realistic-sized raid5 array (with good number of drives, like 5), I/O size will likely be less than 256Kb (with 64Kb minimum realistic chunk size and 4 data drives), so expecting full-stripe writes is not wise (unless it is streaming some large data, in which case 512Kb chunk size (resulting in 2Mb stripes) will do just as well). Also, large chunks may have negative impact on alignment requiriments (ie, it might be more difficult to fullfil the requiriment), but this is different story. Overall, I think 512Kb is quite a good chunk size, even for a raid5 array. /mjt -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html