Re: Help needed for a SCSI hang (SATA drives)

V <viswesh.vichu@xxxxxxxxx> · Tue, 7 Mar 2017 14:28:23 -0800

On Tue, Mar 7, 2017 at 3:58 AM, Jack Wang <xjtuwjp@xxxxxxxxx> wrote:
> 2017-03-07 1:54 GMT+01:00 V <viswesh.vichu@xxxxxxxxx>:
>> Hi,
>>
>> I am reaching out regarding a SCSI error handler issue, which we see
>> in our lab systems.
>>
>> Environment: Ubuntu 4.4.0-31-generic
>>
>> Issue: Write IOs are getting stuck in our system randomly. All drives
>> seen with the issue are all SATA drives.
>> Root cause: SCSI error handler is not woken up (or) it is up, but not
>> able to flush commands and hence the IOs get stuck. (as the requests
>> are not flushed out and host is not restarted)
>>
>> We have 6 instances seen till now.
>>
>> This is what we see.
>> 1) As part of testing our drives, we start lot of read/writes, along
>> with some user space utilities running to check the drive health.
>> 1) In our recent testing, we see lot of commands going through the
>> "abort scheduled" path. And we see that in between, for one of the
>> commands, the error handler is not woken up, and majority of
>> processes, which were writing, gets stalled.
>> 2) We are still trying to figure out what is causing this much number
>> of abort? Is it usual? Are there some other debugs, which I could
>> enable to get more information? We know these are user space commands
>> which are being aborted, but not sure which exact command it is for
>> now.
>> 3) I see the "abort scheduled" messages in almost all drives in all
>> systems, hence i dont believe it is a drive issue.
>> 4) I checked the host_failed and host_busy variables, both are set to
>> 1 in system 1. 2nd one is still alive, havent taken a crash dump yet.
>> 5) All drives seen with the issue are SATA drives.
>> 6) I have attached udevadm info of a drive which failed in system 2.
>>
>> Thanks in advance,
>> Viswesh
>>
>>
>> Logs
>> -------
>> System 1
>>
>> [371546.605736] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1
>> [371546.606727] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1
>> [371546.607721] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1
>> [371546.618113] sd 9:0:0:0: [sg19] tag#20 abort scheduled
>> [371546.624039] sd 9:0:0:0: [sg19] tag#20 aborting command
>> [371546.624045] sd 9:0:0:0: [sg19] tag#20 cmd abort failed
>> [371546.624051] scsi host9: Waking error handler thread
>> [371546.624060] scsi host9: scsi_eh_9: waking up 0/1/1
>> [371546.624081] sd 9:0:0:0: [sg19] tag#20 scsi_eh_9: flush finish cmd
>> [371546.624095] scsi host9: waking up host to restart
>> [371546.624098] scsi host9: scsi_eh_9: sleeping
>>
>>
>> [371546.635314] sd 8:0:0:0: [sg17] tag#13 abort scheduled
>> [371546.640031] sd 8:0:0:0: [sg17] tag#13 aborting command
>> [371546.640037] sd 8:0:0:0: [sg17] tag#13 cmd abort failed
>> [371546.640043] scsi host8: Waking error handler thread
>> [371546.640078] scsi host8: scsi_eh_8: waking up 0/1/1
>> [371546.640098] sd 8:0:0:0: [sg17] tag#13 scsi_eh_8: flush finish cmd
>> [371546.640113] scsi host8: waking up host to restart
>> [371546.640117] scsi host8: scsi_eh_8: sleeping
>>
>> [371546.657269] sd 6:0:0:0: [sg12] tag#1 abort scheduled
>> [371546.664034] sd 6:0:0:0: [sg12] tag#1 aborting command
>> [371546.664040] sd 6:0:0:0: [sg12] tag#1 cmd abort failed
>>
>> Here we can see that, error handler is not woken up.  And the entire
>> IO subsystem goes for a toss.
>>
>> [371777.594510] INFO: task md2_raid1:508 blocked for more than 120 seconds.
>> [371777.594571]      Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu
>> [371777.594613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [371777.594665] md2_raid1      D ffff883fed2afc78    0  508      2 0x00000000
>> [371777.594673]  ffff883fed2afc78 ffff881fef0e8000 ffff883fed21c4c0
>> ffff883fed2b0000
>> [371777.594678]  ffff883ff236e298 ffff883ff236e000 ffff883ff236e018
>> ffff883ff236e018
>>
>> By enabling more scsi logs(in a different system), including LL, we
>> get some more info ( I have enabled the full scsi levels for all
>> categories and it is still running)
>>
>>
>> System 2
>>
>> [405197.171574] sd 5:0:0:0: [sg3] sg_write: count=3D88
>> [405197.171577] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1
>> [405197.171581] sd 5:0:0:0: [sg3] sg_common_write:  scsi
>> opcode=3D0x85, cmd_size=3D16
>> [405197.171583] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D512
>> [405197.171605] sd 5:0:0:0: [sg3] sg_link_reserve: size=3D512
>> [405197.171623] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
>> [405197.172648] sd 5:0:0:0: [sg3] sg_cmd_done: pack_id=3D0, res=3D0x0
>> [405197.172701] sd 5:0:0:0: [sg3] sg_poll: res=3D0x145
>> [405197.172710] sd 5:0:0:0: [sg3] sg_read: count=3D88
>> [405197.172716] sd 5:0:0:0: [sg3] sg_finish_rem_req: res_used=3D1
>> [405197.172721] sd 5:0:0:0: [sg3] sg_unlink_reserve: req->k_use_sg=3D1
>> [405197.172756] sd 5:0:0:0: [sg3] sg_write: count=3D88
>> [405197.172760] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1
>> [405197.172764] sd 5:0:0:0: [sg3] sg_common_write:  scsi
>> opcode=3D0x85, cmd_size=3D16
>> [405197.172767] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D512
>> [405197.172774] sd 5:0:0:0: [sg3] sg_link_reserve: size=3D512
>> [405197.172793] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
>> [405197.173806] sd 5:0:0:0: [sg3] sg_cmd_done: pack_id=3D0, res=3D0x0
>> [405197.173829] sd 5:0:0:0: [sg3] sg_poll: res=3D0x145
>> [405197.173836] sd 5:0:0:0: [sg3] sg_read: count=3D88
>> [405197.173839] sd 5:0:0:0: [sg3] sg_finish_rem_req: res_used=3D1
>> [405197.173843] sd 5:0:0:0: [sg3] sg_unlink_reserve: req->k_use_sg=3D1
>> [405197.173893] sd 5:0:0:0: [sg3] sg_write: count=3D88
>> [405197.173896] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1
>> [405197.173900] sd 5:0:0:0: [sg3] sg_common_write:  scsi
>> opcode=3D0x85, cmd_size=3D16
>> [405197.173903] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D0
>> [405197.173921] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
>>
>> [405197.174082] sd 5:0:0:0: [sg3] tag#5 abort scheduled
>> [405197.179372] sd 5:0:0:0: [sg3] tag#5 aborting command
>> [405197.179378] sd 5:0:0:0: [sg3] tag#5 cmd abort failed
>>
>> [405257.225584] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104
>> [405257.225599] sd 5:0:0:0: [sg3] sg_release
>> [405257.225617] sd 5:0:0:0: [sg3] sg_open: flags=3D0x8002
>>
>> Error handler thread is not waken up here (or) i dont see flush
>> command trace (even if it is already woken up)
>>
>> [405349.082863] INFO: task md2_raid1:484 blocked for more than 120 seconds
>> [405349.082922]      Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu
>> [405349.082964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [405349.083016] md2_raid1      D ffff881feca2bc78    0  484      2 0x00000000
>
> Can you reproduce without md raid1? I suppose it shouldn't related.

I believe so. It is not related.

> You mentioned it's only with SATA, are they all the same HDDs or from
> different vendors?
>

It is 2 different vendors (Sandisk and HGST), but now under the same
umbrella though.

Are there anything quite evident from the logs. i could collect some
proc entries, if you would need.

> Regards,
> Jack