Re: [Bug 9734] New: I/O error when inserting a second firewire sata disk

Stefan Richter <stefanr@xxxxxxxxxxxxxxxxx> · Sat, 19 Jan 2008 14:20:19 +0100 (CET)

I did a quick and simple test now:

  1.) switch on 1st disk (sdd)

Jan 19 13:46:18 stein sd 88:0:0:0: [sdd] Attached SCSI disk
Jan 19 13:46:18 stein sd 88:0:0:0: Attached scsi generic sg3 type 14

  2.) start "dd if=/dev/sdd of=/dev/null"

  3.) switch on 2nd disk (sde)

Jan 19 13:48:11 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871464
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983933
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983934
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983935
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983936
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983937
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983938
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983939
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983940
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983941
Jan 19 13:48:11 stein Buffer I/O error on device sdd, logical block 983942
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871712
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871720
Jan 19 13:48:11 stein sd 88:0:0:0: [sdd] Result: hostbyte=DID_BUS_BUSY driverbyte=DRIVER_OK,SUGGEST_OK
Jan 19 13:48:11 stein end_request: I/O error, dev sdd, sector 7871464
Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:11 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:12 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:12 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:12 stein firewire_sbp2: failed to reconnect to fw3.0
Jan 19 13:48:12 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries)
Jan 19 13:48:26 stein firewire_sbp2: orb reply timed out, rcode=0x11
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein scsi89 : SBP-2 IEEE-1394
Jan 19 13:48:27 stein firewire_core: created new fw device fw4 (6 config rom retries, S800)
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: logged in to fw4.0 LUN 0000 (0 retries)
Jan 19 13:48:27 stein scsi 89:0:0:0: Direct-Access-RBC HDS72404 0KLAT80          KFAO PQ: 0 ANSI: 4
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] 1562845488 512-byte hardware sectors (800177 MB)
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Write Protect is off
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Mode Sense: 11 00 00 00
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 19 13:48:27 stein sde: sde1
Jan 19 13:48:27 stein sd 89:0:0:0: [sde] Attached SCSI disk
Jan 19 13:48:27 stein sd 89:0:0:0: Attached scsi generic sg4 type 14
Jan 19 13:48:27 stein firewire_sbp2: error status: 0:9
Jan 19 13:48:27 stein firewire_sbp2: failed to reconnect to fw3.0
Jan 19 13:48:28 stein firewire_sbp2: logged in to fw3.0 LUN 0000 (0 retries)

Doing that, dd aborted:

dd: reading `/dev/sdd': Input/output error
7871464+0 records in
7871464+0 records out
4030189568 bytes (4.0 GB) copied, 57.6538 s, 69.9 MB/s

sdd was however accessible again after the login at 13:48:28.  As you
see, the SCSI stack did not take the disk offline after the I/O errors.
However, the desired behavior is that dd is simply put into I/O wait
state rather than failing with errors while the SBP-2 transport is busy.

Both disks used in this test are based on the OXFW912 bridge.  The
"error status: 0:9" before reconnect is typical for them, at least as I
have set them up right now (1394b card, 1394b hub, disks on the hub).
The status means "request complete, function rejected".  It has always
been recoverable here when I switched on the 2nd disk with the 1st disk
connected but idle.

Perhaps we should do it in fw-sbp2 like in sbp2:  Put the scsi_device
(in sbp2: the Scsi_Host) into blocked state between bus reset event and
successful reconnection.  We took a while to get this right in sbp2
because blocking the host is prone to deadlocks.  Therefore I was so far
satisfied with fw-sbp2 not using that scheme.

However, the first thing I shall try is to insert a "return
SCSI_MLQUEUE_HOST_BUSY" early in sbp2_scsi_queuecommand, depending on a
check of the logical unit's generation.

A variation of this problem is if the bus reset -- reconnect phase
happens while the SCSI stack is just in the middle of INQUIRY or READ
CAPACITY.  During this, the SCSI stack is very quick to offline a
device.  We need to do something there, because this situation is not
too unusual (e.g. when several FireWire devices are being powered up
together).

I will proceed to experiment with this as spare time permits.  Of course
any advice on how to best interact with the SCSI core while the
transport is busy would be appreciated.
-- 
Stefan Richter
-=====-==--- ---= =--==
http://arcgraph.de/sr/

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html