> So the responsibility problem is solved here, right? It is? I'm not sure yet. > I mean, if there's no resync going on (the case with --assume-clean), the rest of the system works as expected, right? Yes, but the array itself is still dog slow and resync shouldn't have that much impact on performance. What's more mdadm --create /dev/md0 -l5 -n4 /dev/sd[bcde] -e 1.0 leaves room for tuning but is basically fine, whereas the original case mdadm --create /dev/md0 --verbose --metadata=1.0 --homehost=jesus -n4 -c1024 -l5 --bitmap=internal --name tb-storage -ayes /dev/sd[bcde] is all but unusable, which leaves two prime suspects - the bitmap - the chunk size Could it be that some cache or other somewhere in the I/O stack (probaby the controller itself) is too small for the 1MB chunks and the disks are forced to work serially? The Promise has no RAM of course but maybe it does have small send / receive buffers. On the host side the I/O schedulers are set to cfq which is said to play well with md-raid but I can experiment with that as well. > Note that mkfs now has to do 3x more work, too - since the device is 3x (for 4-drive raid5) larger. Yes, but that just means there's more inode tables to write. It takes longer, but the speed shouldn't change much. > Ok. For now I don't see a problem (over than that there IS a problem > somewhere - obviously). Interrupts are ok. System time (10.1%) in > second case doesn't look right, but it was 8.1% before... Too high? Too low? > Only 2 guesses left. I'm fine with guesses, thank you :-) Of course a deus-ex-machina solution (deus == Neil) is nice, too :) > First, try to disable bitmaps on the raid array Maybe I did that by accident for the various vmstat data for different RAID levels I posted previously. At least I forgot to explicitely specify a bitmap for those tests (see above). It's my understanding that the bitmap is a raid chunk level journal to speed up recovery, correct? Doing that reduces the window during which a second disk can die with catastrophic consequences -> bitmaps are a good thing, especially on an array where a full rebuild takes hours. Seeing as the primary purpose of the raid5 is fault tolerance I could live with a performance penalty but why is it *that* slow? If I put the bitmap on an external drive it will be a lot faster - but what happens, when the bitmap "goes away" (because that disk fails, isn't accessible, etc)? Is it goodbye array or is the worst case a full resync? How well is the external bitmap supported? (That same consideration kept me from using external journals for ext3.) > And second, the whole thing looks pretty much like a more general > problem discussed here and elsewhere last few days. I mean handling > of parallel reads and writes - when single write may stall reads > for quite some time and vise versa. Any thread names to recommend? > I see it every day on disks without NCQ/TCQ [...] your disks > and/or controllers (or the combination) don't even support NCQ The old IDE disks on mixed noname controllers array does well enough and NCQ / ncq doesn't even show up in dmesg. Definitely something to consider but probably not the root cause. Back to testing ... C. -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html