On Wed, 2021-09-15 at 13:55 -0400, Douglas Gilbert wrote: > On 2021-09-15 10:43 a.m., James Bottomley wrote: > > On Wed, 2021-09-15 at 08:04 -0500, wenxiong@xxxxxxxxxxxxxxxxxx > > wrote: > > > From: Wen Xiong <wenxiong@xxxxxxxxxxxxxxxxxx> > > > > > > Setting scsi logging level with error=3, we saw some errors from > > > enclosues: > > > > > > [108017.360833] ses 0:0:9:0: tag#641 Done: NEEDS_RETRY Result: > > > hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=0s > > > [108017.360838] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c > > > 01 01 > > > 00 20 00 > > > [108017.427778] ses 0:0:9:0: Power-on or device reset occurred > > > [108017.427784] ses 0:0:9:0: tag#641 Done: SUCCESS Result: > > > hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s > > > [108017.427788] ses 0:0:9:0: tag#641 CDB: Receive Diagnostic 1c > > > 01 01 > > > 00 20 00 > > > [108017.427791] ses 0:0:9:0: tag#641 Sense Key : Unit Attention > > > [current] > > > [108017.427793] ses 0:0:9:0: tag#641 Add. Sense: Bus device reset > > > function occurred > > > [108017.427801] ses 0:0:9:0: Failed to get diagnostic page 0x1 > > > [108017.427804] ses 0:0:9:0: Failed to bind enclosure -19 > > > [108017.427895] ses 0:0:10:0: Attached Enclosure device > > > [108017.427942] ses 0:0:10:0: Attached scsi generic sg18 type 13 > > > > > > As Martin's suggestion, the patch checks to retry on NOT_READY as > > > well as UNIT_ATTENTION with ASC 0x29. > > > > This looks fine to me. I think the reason expecting_cc_ua doesn't > > work for you is that you're getting > 1 reset per > > command. expecting_cc_ua automatically resets after eating the > > first unit attention. > > Rather that simply consuming UAs, what do you think of a fixed length > FIFO, say 8 entries, that holds the asc,ascq of the last 8 UAs > together with a timestamp of when it was received (with a boot time > epoch). Then allow the user space to see that FIFO via sysfs (e.g. > under /sys/class/scsi_device/<hctl>). Remembering the previous UA may > also be useful for the mid-level UA processing. For example after a > firmware upgrade, there may be UAs for both INQUIRY data change and > device reset. I don't really see how it would be useful. On the available debugging information (which is a bit scant) it looks like a single device problem (as usual) ... possibly not clearing the condition after the REQUEST_SENSE because I can't believe another reset went over the bus immediately after the first UA/CC was received. > Also the first device reset after a reboot (power cycle) is expected, > having one later, for example when part of a disk is mounted, is a > bit more suspicious. It's definitely hugely suspicious, but tying it to ASC/ASQC 0x29/0x00 should limit the fallout considerably. James