On Fri, 2006-03-24 at 15:38 +1100, Neil Brown wrote: > On Thursday March 23, aizvorski@xxxxxxxxx wrote: > > Neil - Thank you very much for the response. > > > > In my tests with identically configured raid0 and raid5 arrays, raid5 > > initially had much lower throughput during reads. I had assumed that > > was because raid5 did parity-checking all the time. It turns out that > > raid5 throughput can get fairly close to raid0 throughput > > if /sys/block/md0/md/stripe_cache_size is set to a very high value, > > 8192-16384. However the cpu load is still very much higher during raid5 > > reads. I'm not sure why? > > Probably all the memcpys. > For a raid5 read, the data is DMAed from the device into the > stripe_cache, and then memcpy is used to move it to the filesystem (or > other client) buffer. Worse: this memcpy happens on only one CPU so a > multiprocessor won't make it go any after. > > I would be possible to bypass the stripe_cache for reads from a > non-degraded array (I did it for 2.4) but it is somewhat more complex > in 2.6 and I haven't attempted it yet (there have always been other > more interesting things to do). > > To test is this is the problem you could probably just comment-out the > memcpy (the copy_data in handle_stripe) and see if the reads go > faster. Obviously you will be getting garbage back, but it should > give you a reasonably realistic measure of the cost. > > NeilBrown Neil - Thank you again for the suggestion. I did as you said and commented out copy_data() and ran a number of tests with the modified kernel. The results are in a spreadsheet-importable format at the end of this email (let me know if I should send them in some other way). In short, this gives a fairly consistent 20% reduction in CPU usage under max throughput conditions, i.e. typically that accounts for just over half the difference in CPU usage between raid0 and raid5, everything else being equal. By the way, on the same machine memcpy() benchmarks at ~1GB/s, so if the data being is read at 200MB/s and copied once that would be about 10% CPU load - perhaps the data actually gets copied twice? That would be consistent. Anyway, it seems copy_data() is definitely part of the answer, but not the whole answer. In the case of 32MB stripes, something else uses up to 60% of the CPU time. Perhaps some kind of O(n^2) scalability issue in the stripe cache data structures? I'm not positive, but it seems the hit outside copy_data() is particularly large in situations in which stripe_cache_active returns large numbers. How hard is it to bypass the stripe cache for reads? I would certainly lobby for you to work on that ;) since without it raid5 is only really suitable for database-type workloads, not multimedia-type workloads (again bearing in mind that a full-speed read by itself uses up an entire high-end CPU or more - you can understand why I thought it was calculating parity ;) I'll do what I can to help, of course. Let me know what other tests I can run. Regards, --Alex "raid level"|"num disks"|"chunk size, kB"|"copy_data disabled"|"stripe cache size"|"block read size, MB"|"num concurrent reads"|"throughput, MB/s"|"cpu load, %" raid5|8|64|N|8192|8|14|186|35 raid0|7|64|-|-|8|14|243|7 raid5|8|64|N|8192|256|1|215|38 raid0|7|64|-|-|256|1|272|7 raid5|8|256|Y|8192|8|14|201|17 raid5|8|256|N|8192|8|14|200|40 raid0|7|256|-|-|8|14|241|4 raid5|8|256|Y|8192|256|1|221|17 raid5|8|256|N|8192|256|1|218|40 raid0|7|256|-|-|256|1|260|6 raid5|8|1024|Y|8192|8|14|207|20 raid5|8|1024|N|8192|8|14|206|40 raid0|7|1024|-|-|8|14|243|5 raid5|8|32768|Y|16384|8|14|227|60 raid5|8|32768|N|16384|8|14|208|80 raid0|7|32768|-|-|8|14|244|15 raid5|8|32768|Y|16384|256|1|212|25 raid5|8|32768|N|16384|256|1|207|45 raid0|7|32768|-|-|256|1|217|10 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html