I'm going to hang on to the hardware. This is a pilot/demo that may lead to development of a new device, and, if so, I'll be getting back into device driver writing. Working this problem would be great practice for that. So I will do it. The only problem is I don't know when! I believe I can replicate the problem, so I'll find time (perhaps next weekend) to capture the data of interest. Mr. Stern: Where might I go for low level programming information on USB devices? I'm interested in registers/DMA/packet formats, etc. I've found info on the USB protocol itself, but I haven't found info on devices. Obviously I can dig through kernel source, but documents would be nice! Again, if this is an unreasonable request for you to "do my homework," just say so! I won't be offended. I'm sure I can find it myself given time, but if you happen to have some URLs handy, they'd be appreciated. YET AGAIN thank you both! You've been of great help. -- Michael Schwarz > Michael Schwarz wrote: >> More than ever, I am convinced that it is actually a hardware problem, >> but >> I am curious for the opinions of both of you on whether the "system" >> (meaning, I guess, the combination of usb-storage driver and raid) is >> really doing the best with what it has. >> > > See below, but the short answer is there is probably room for improvement. >> My last effort was to switch to a different computer. When I did, I got >> in >> the dmesg log (unfortunately, not preserved, although I should be able >> to >> recreate) that one of the flash drives had bad blocks. Some part of the >> system eventually decided it was a "dead device" (I believe dmesg >> indicate >> the scsi subsystem said so). The device (it happened to be /dev/sdc) was >> peremptorially dropped from the system. This appears to be what hanged >> the >> raid system. >> >> (Why these messages never appeared on the other computer is beyond me; >> obviously some difference in how the actual USB controller reports >> errors, >> but, as I said, I've never studied USB drivers or hardware. In fact, >> once >> you get beyond the UARTs you are getting sophisticated to me) >> >> I've built an array of five known-good devices and so far it works >> swimmingly (at least on the hardware that was better at error >> reporting). >> >> So it seems to me that there is probably nothing actually wrong with the >> drivers or their interactions at it leaves me only asking if there >> should >> be some sort of improvement in error reporting/recovery up to userland. >> >> If I am right and the scsi system was marking a device as dead, >> shouldn't >> the userland read against the md device get an error instead of an >> indefinite hang? >> > > Let me make sure I have this scenario right... one write process (dd or > cp) hangs, but you can still access data on the array, so the devices > (all of them?) are working. It would be useful at that point to see if > /proc/mdstat shows one device as failed. > > Given that I have described the behavior, I would think that there is > still a problem in the driver or md somewhere, hangs should time out, > errors should be reported up, and if this is caused by a lost write > completion, I would hope that would be timed out and reported. That's my > read on it, these "just hangs" cases probably are undetected or > mishandled errors which should be passed up and reported to the > application or retried and completed. Or handled in some better way than > what you describe. > > Bad hardware is a fact of life, if you feel like chasing this more, an > understanding of what the hardware did wrong and what the kernel didn't > do right would be helpful. Of course the failure mode may be so rare, > and the fix so time-consuming that it won't get fixed, but it can get > documented. >> Beyond this question which I leave to you (although I'd love to hear >> your >> answers/thoughts), I think we can safely say that the problem was >> hardware >> (even if hard to find). If either of you would like, I'd be happy to >> find >> time this week to recreate the error on my "better" PC and send that >> along. >> >> As for rolling a custom kernel with more message buffer, well, I'm going >> to be getting into a new device driver in the coming months, so a custom >> debug kernel is definitely in my future, but I'm not sure when. >> >> I must say, the kernel has become a much more complex beastie since >> 2.2.x! >> (Although it also appears to be improved and somewhat more organized -- >> but definitely MUCH larger!) >> >> Thank you both so much! I wouldn't even have diagnosed my hardware >> problem >> without your prompts. I'm very grateful. Let me know if you'd like those >> dmesg logs or if you'd just like to let it go! >> >> > -- > > bill davidsen <davidsen@xxxxxxx> > CTO TMR Associates, Inc > Doing interesting things with small computers since 1979 > > - > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html