Re: sd 6:0:0:0: [sdb] Unaligned partial completion

James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> · Mon, 11 Jun 2018 14:40:51 -0700

On Mon, 2018-06-11 at 12:20 -0400, Douglas Gilbert wrote:
> On 2018-06-11 12:07 PM, Ted Cabeen wrote:
> > I'm seeing a similar behavior on my system, but across multiple
> > devices on a SAS drive array (front bays on a Supermicro-based
> > system with onboard mpt3sas card). 
> > The Sense Key here doesn't show a medium error, and the multiple-
> > drive behavior makes me think it's more likely either a controller
> > or cable problem. Interestingly, the issue only shows up under
> > heavy load (specifically a ZFS scrub).
> > 
> > During my next downtime window, I'm going to try to re-create the
> > problem while capturing a blktrace.  Any other things to try at
> > that time, or a filter-mask I should apply?
> > 
> > [Wed Jun  6 14:30:19 2018] blk_update_request: I/O error, dev sdn,
> > sector 
> > 1757633640
> > [Wed Jun  6 14:37:10 2018] sd 15:0:5:0: unaligned partial
> > completion avoided 
> > (xfer_cnt=3072, sector_sz=4096)
> > [Wed Jun  6 14:37:10 2018] sd 15:0:5:0: [sdr] FAILED Result:
> > hostbyte=DID_OK 
> > driverbyte=DRIVER_SENSE
> > [Wed Jun  6 14:37:10 2018] sd 15:0:5:0: [sdr] Sense Key : Aborted
> > Command 
> > [current] [descriptor]
> > [Wed Jun  6 14:37:10 2018] sd 15:0:5:0: [sdr] Add. Sense: Nak
> > received
> > [Wed Jun  6 14:37:10 2018] sd 15:0:5:0: [sdr] CDB: Read(10) 28 00
> > 07 8a c1 ca 00 
> > 00 01 00
> > [Wed Jun  6 14:37:10 2018] blk_update_request: I/O error, dev sdr,
> > sector 
> > 1012272720
> > [Wed Jun  6 15:20:43 2018] sd 15:0:8:0: unaligned partial
> > completion avoided 
> > (xfer_cnt=52224, sector_sz=4096)
> > [Wed Jun  6 15:20:43 2018] sd 15:0:8:0: [sdu] FAILED Result:
> > hostbyte=DID_OK 
> > driverbyte=DRIVER_SENSE
> > [Wed Jun  6 15:20:43 2018] sd 15:0:8:0: [sdu] Sense Key : Aborted
> > Command 
> > [current] [descriptor]
> > [Wed Jun  6 15:20:43 2018] sd 15:0:8:0: [sdu] Add. Sense: Nak
> > received
> > [Wed Jun  6 15:20:43 2018] sd 15:0:8:0: [sdu] CDB: Read(10) 28 00
> > 12 ab dc 52 00 
> > 00 19 00
> > [Wed Jun  6 15:20:43 2018] blk_update_request: I/O error, dev sdu,
> > sector 
> > 2506023568
> > [Wed Jun  6 15:46:20 2018] sd 15:0:2:0: unaligned partial
> > completion avoided 
> > (xfer_cnt=11264, sector_sz=4096)
> > [Wed Jun  6 15:46:20 2018] sd 15:0:2:0: [sdo] FAILED Result:
> > hostbyte=DID_OK 
> > driverbyte=DRIVER_SENSE
> > [Wed Jun  6 15:46:20 2018] sd 15:0:2:0: [sdo] Sense Key : Aborted
> > Command 
> > [current] [descriptor]
> > [Wed Jun  6 15:46:20 2018] sd 15:0:2:0: [sdo] Add. Sense: Nak
> > received
> > [Wed Jun  6 15:46:20 2018] sd 15:0:2:0: [sdo] CDB: Read(10) 28 00
> > 40 a8 ef b5 00 
> > 00 03 00
> > [Wed Jun  6 15:46:20 2018] blk_update_request: I/O error, dev sdo,
> > sector 
> > 8678505896
> > 
> 
> I have also seen Aborted Command sense when doing heavy testing on
> one or more SAS disks behind a SAS expander. I put it down to a
> temporary lack of paths available (on the link between the host's HBA
> and the expander) when one of those SAS disks tries to get a
> connection back to the host with the data (data-in transfer) from an
> earlier READ command.
> 
> In my code (ddpt and sg_dd) I treat it as a "retry" type error and in
> my experience that works. IOW a follow-up READ with the same
> parameters is successful.

We do treat ABORTED_COMMAND as a retry.  However, it will tick down the
retry count (usually 3) and then fail if it still occurs.  How long
does this condition persist for? because if it's long lived we could
treat it as ADD_TO_MLQUEUE which would mean we'd retry until the
timeout condition was reached.

James