Re: Raid5 over sbp2 : sbp2 command abort

Stefan Richter <stefanr@xxxxxxxxxxxxxxxxx> · Mon, 30 Jan 2006 21:00:19 +0100

Francois Barre wrote:
This is a cross-post (sorry for that), but I don't know where it comes from yet.

Alas we get similar reports about software RAID over SBP-2 now and then 
on linux1394-devel or -user. I very much suspect sbp2 to be the culprit.

One person reported different results with different software RAID 
levels but I am too lazy right now to dig for the post in the list archive.

Question to the linux-raid folks: Does md support disks on different 
SCSI host adapters to be in the same RAID set?

A. The setup
VIA EPIA 10k Nehemiah, OHCI with VIA
4 sbp2 250Go IDE drives

Are these drives' bridges based on a Proflific chip? If yes, check if 
you could get a firmware update.

Vanilla 2.6.15.1 kernel, mdadm 2.2, superblock 0.90
ohci1394+sbp2 in kernel (default params : serialize_io=1, ...), raid5
as a module.

I recommend to build the FireWire drivers as modules. This enables you 
to unload and reload them e.g. to recover from some failures or to try 
different parameters. However, static linking or building them as 
modules does not have an effect on reliability during data transfers.

B. The tests
Test0 : Creating a 4-drive raid5 with 1 drive missing, copying the 4th
drive content to the raid5, works great.
Stress-testing multiple drive copy seems to be ok (Test0 + various
tests), very responsive, absolutely no error, but Test1 has a lot of
'command abort' errors, which blocks io for seconds, then starts
again.

Test1 : Building from scratch the raid5 with 4 drives (i.e. none
missing), causes 'sbp2 : command abort' messages.

Are there any other suspicious messages from sbp2, ieee1394, or ohci1394?

At the end of Test1, raid5 is not created : one drive is set faulty.

C. The questions :
How could I run a paranoïd/degraded bandwidth mode ? I tried playing
with /proc/sys/dev/raid/speed_limit_max, reducing to far away from
highest bandwidth, but it did not have the expected behaviour : io
runs to highest bw for seconds, then stops, then runs again at highest
rate, ...

What about sbp2's max_speed parameter?

Is there a way to avoid write back at sbp2 level ? I could not find
any way to do so...

What do you mean by that?

What kernel version should I rather use ? Seems like scsi on 2.6.15.x
is not really trustworthy, should I run 2.6.14.x ?

"aborting sbp2 command" issues have been reported for quite a long time 
now. Especially for Linux 2.6, although 2.4's sbp2 isn't fundamentally 
different. I don't think 2.6.14.x would make a difference to 2.6.15.x 
with this particular problem.

BTW, I'm hoping to get some spare time in February in order to work on 
this particular problem. I never used software RAID over sbp2 myself and 
don't intend to do so any time soon, but I get what I suspect to be the 
same type of failures with a 1394a disk and with a 1394b JBOD device (or 
hardware "R"AID-0) myself.

In case of my 1394a disk, the failures vanish either with serialize_io=1 
(this was not required with an older kernel; I don't remember which one) 
or --- curiously enough --- with "gap count optimization". As I wrote an 
hour ago on linux1394-user, gap count optimization is a performance 
tuning of the FireWire bus and is not yet implemented in the kernel. You 
can get gap count optimization manually with "echo p 0x00450000 | 
1394commander" for a single external device or "echo p 0x004a0000 | 
1394commander" if 4 external devices are daisy-chained. Run the command 
after all disks were connected and switched on, otherwise the command 
may inhibit access to newly added devices. www.linux1394.org has a link 
to 1394commander.
--
Stefan Richter
-=====-=-==- ---= ====-
http://arcgraph.de/sr/
-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html