>>>>> Neil Brown (NB) writes: NB> The raid5 code attempts to do this already, though I'm not sure how NB> successful it is. I think it is fairly successful, but not completely NB> successful. hmm. could you tell me what the code should I look at? NB> There is a trade-off that raid5 has to make. Waiting longer can mean NB> more blocks on the same stripe, and so less reads. But waiting longer NB> can also increase latency which might not be good. yes, I agree. NB> The thing to would be to put some tracing in to find out exactly what NB> is happening for some sample workloads, and then see if anything can NB> be improved. well, the simplest case I tried was this: mdadm -C /dev/md0 --level=5 --chunk=8 --raid-disks=3 ... then open /dev/md0 with O_DIRECT and send a write of 16K. it ended up, doing few writes and one read. the sequence was: 1) serving first 4K of the request - put the stripe it onto delayed list 2) serving 2nd 4KB -- again onto delayed list 3) serving 3rd 4KB -- get a full uptodate stripe, time to make the parity 3 writes are issued for stripe #0 4) raid5_unplug_device() is called because of those 3 writes it activates delayed stripe #4 5) raid5d() finds stripe #4 and issues READ ... I tend to think this isn't the most optimal way. couldn't we take current request into account somehow. something like "keep delayed off the queue until current requests aren't served AND stripe cache isn't full". another similar case is when you have two processes writing to very different stripes and low-level requests they make from handle_stripe() cause delayed stripes to get activated. thanks, Alex - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html