Michael Schwarz wrote:
More than ever, I am convinced that it is actually a hardware problem, but I am curious for the opinions of both of you on whether the "system" (meaning, I guess, the combination of usb-storage driver and raid) is really doing the best with what it has.
See below, but the short answer is there is probably room for improvement.
My last effort was to switch to a different computer. When I did, I got in the dmesg log (unfortunately, not preserved, although I should be able to recreate) that one of the flash drives had bad blocks. Some part of the system eventually decided it was a "dead device" (I believe dmesg indicate the scsi subsystem said so). The device (it happened to be /dev/sdc) was peremptorially dropped from the system. This appears to be what hanged the raid system. (Why these messages never appeared on the other computer is beyond me; obviously some difference in how the actual USB controller reports errors, but, as I said, I've never studied USB drivers or hardware. In fact, once you get beyond the UARTs you are getting sophisticated to me) I've built an array of five known-good devices and so far it works swimmingly (at least on the hardware that was better at error reporting). So it seems to me that there is probably nothing actually wrong with the drivers or their interactions at it leaves me only asking if there should be some sort of improvement in error reporting/recovery up to userland. If I am right and the scsi system was marking a device as dead, shouldn't the userland read against the md device get an error instead of an indefinite hang?
Let me make sure I have this scenario right... one write process (dd or cp) hangs, but you can still access data on the array, so the devices (all of them?) are working. It would be useful at that point to see if /proc/mdstat shows one device as failed.
Given that I have described the behavior, I would think that there is still a problem in the driver or md somewhere, hangs should time out, errors should be reported up, and if this is caused by a lost write completion, I would hope that would be timed out and reported. That's my read on it, these "just hangs" cases probably are undetected or mishandled errors which should be passed up and reported to the application or retried and completed. Or handled in some better way than what you describe.
Bad hardware is a fact of life, if you feel like chasing this more, an understanding of what the hardware did wrong and what the kernel didn't do right would be helpful. Of course the failure mode may be so rare, and the fix so time-consuming that it won't get fixed, but it can get documented.
Beyond this question which I leave to you (although I'd love to hear your answers/thoughts), I think we can safely say that the problem was hardware (even if hard to find). If either of you would like, I'd be happy to find time this week to recreate the error on my "better" PC and send that along. As for rolling a custom kernel with more message buffer, well, I'm going to be getting into a new device driver in the coming months, so a custom debug kernel is definitely in my future, but I'm not sure when. I must say, the kernel has become a much more complex beastie since 2.2.x! (Although it also appears to be improved and somewhat more organized -- but definitely MUCH larger!) Thank you both so much! I wouldn't even have diagnosed my hardware problem without your prompts. I'm very grateful. Let me know if you'd like those dmesg logs or if you'd just like to let it go!
-- bill davidsen <davidsen@xxxxxxx> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html