Francois Barre wrote:
This is a cross-post (sorry for that), but I don't know where it comes from yet.
Alas we get similar reports about software RAID over SBP-2 now and then on linux1394-devel or -user. I very much suspect sbp2 to be the culprit.
One person reported different results with different software RAID levels but I am too lazy right now to dig for the post in the list archive.
Question to the linux-raid folks: Does md support disks on different SCSI host adapters to be in the same RAID set?
A. The setup VIA EPIA 10k Nehemiah, OHCI with VIA 4 sbp2 250Go IDE drives
Are these drives' bridges based on a Proflific chip? If yes, check if you could get a firmware update.
Vanilla 2.6.15.1 kernel, mdadm 2.2, superblock 0.90 ohci1394+sbp2 in kernel (default params : serialize_io=1, ...), raid5 as a module.
I recommend to build the FireWire drivers as modules. This enables you to unload and reload them e.g. to recover from some failures or to try different parameters. However, static linking or building them as modules does not have an effect on reliability during data transfers.
B. The tests Test0 : Creating a 4-drive raid5 with 1 drive missing, copying the 4th drive content to the raid5, works great. Stress-testing multiple drive copy seems to be ok (Test0 + various tests), very responsive, absolutely no error, but Test1 has a lot of 'command abort' errors, which blocks io for seconds, then starts again. Test1 : Building from scratch the raid5 with 4 drives (i.e. none missing), causes 'sbp2 : command abort' messages.
Are there any other suspicious messages from sbp2, ieee1394, or ohci1394?
At the end of Test1, raid5 is not created : one drive is set faulty. C. The questions : How could I run a paranoïd/degraded bandwidth mode ? I tried playing with /proc/sys/dev/raid/speed_limit_max, reducing to far away from highest bandwidth, but it did not have the expected behaviour : io runs to highest bw for seconds, then stops, then runs again at highest rate, ...
What about sbp2's max_speed parameter?
Is there a way to avoid write back at sbp2 level ? I could not find any way to do so...
What do you mean by that?
What kernel version should I rather use ? Seems like scsi on 2.6.15.x is not really trustworthy, should I run 2.6.14.x ?
"aborting sbp2 command" issues have been reported for quite a long time now. Especially for Linux 2.6, although 2.4's sbp2 isn't fundamentally different. I don't think 2.6.14.x would make a difference to 2.6.15.x with this particular problem.
BTW, I'm hoping to get some spare time in February in order to work on this particular problem. I never used software RAID over sbp2 myself and don't intend to do so any time soon, but I get what I suspect to be the same type of failures with a 1394a disk and with a 1394b JBOD device (or hardware "R"AID-0) myself.
In case of my 1394a disk, the failures vanish either with serialize_io=1 (this was not required with an older kernel; I don't remember which one) or --- curiously enough --- with "gap count optimization". As I wrote an hour ago on linux1394-user, gap count optimization is a performance tuning of the FireWire bus and is not yet implemented in the kernel. You can get gap count optimization manually with "echo p 0x00450000 | 1394commander" for a single external device or "echo p 0x004a0000 | 1394commander" if 4 external devices are daisy-chained. Run the command after all disks were connected and switched on, otherwise the command may inhibit access to newly added devices. www.linux1394.org has a link to 1394commander.
-- Stefan Richter -=====-=-==- ---= ====- http://arcgraph.de/sr/ - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html