Hi, I am reaching out regarding a SCSI error handler issue, which we see in our lab systems. Environment: Ubuntu 4.4.0-31-generic Issue: Write IOs are getting stuck in our system randomly. All drives seen with the issue are all SATA drives. Root cause: SCSI error handler is not woken up (or) it is up, but not able to flush commands and hence the IOs get stuck. (as the requests are not flushed out and host is not restarted) We have 6 instances seen till now. This is what we see. 1) As part of testing our drives, we start lot of read/writes, along with some user space utilities running to check the drive health. 1) In our recent testing, we see lot of commands going through the "abort scheduled" path. And we see that in between, for one of the commands, the error handler is not woken up, and majority of processes, which were writing, gets stalled. 2) We are still trying to figure out what is causing this much number of abort? Is it usual? Are there some other debugs, which I could enable to get more information? We know these are user space commands which are being aborted, but not sure which exact command it is for now. 3) I see the "abort scheduled" messages in almost all drives in all systems, hence i dont believe it is a drive issue. 4) I checked the host_failed and host_busy variables, both are set to 1 in system 1. 2nd one is still alive, havent taken a crash dump yet. 5) All drives seen with the issue are SATA drives. 6) I have attached udevadm info of a drive which failed in system 2. Thanks in advance, Viswesh Logs ------- System 1 [371546.605736] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1 [371546.606727] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1 [371546.607721] sd 9:0:0:0: scsi_block_when_processing_errors: rtn: 1 [371546.618113] sd 9:0:0:0: [sg19] tag#20 abort scheduled [371546.624039] sd 9:0:0:0: [sg19] tag#20 aborting command [371546.624045] sd 9:0:0:0: [sg19] tag#20 cmd abort failed [371546.624051] scsi host9: Waking error handler thread [371546.624060] scsi host9: scsi_eh_9: waking up 0/1/1 [371546.624081] sd 9:0:0:0: [sg19] tag#20 scsi_eh_9: flush finish cmd [371546.624095] scsi host9: waking up host to restart [371546.624098] scsi host9: scsi_eh_9: sleeping [371546.635314] sd 8:0:0:0: [sg17] tag#13 abort scheduled [371546.640031] sd 8:0:0:0: [sg17] tag#13 aborting command [371546.640037] sd 8:0:0:0: [sg17] tag#13 cmd abort failed [371546.640043] scsi host8: Waking error handler thread [371546.640078] scsi host8: scsi_eh_8: waking up 0/1/1 [371546.640098] sd 8:0:0:0: [sg17] tag#13 scsi_eh_8: flush finish cmd [371546.640113] scsi host8: waking up host to restart [371546.640117] scsi host8: scsi_eh_8: sleeping [371546.657269] sd 6:0:0:0: [sg12] tag#1 abort scheduled [371546.664034] sd 6:0:0:0: [sg12] tag#1 aborting command [371546.664040] sd 6:0:0:0: [sg12] tag#1 cmd abort failed Here we can see that, error handler is not woken up. And the entire IO subsystem goes for a toss. [371777.594510] INFO: task md2_raid1:508 blocked for more than 120 seconds. [371777.594571] Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu [371777.594613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [371777.594665] md2_raid1 D ffff883fed2afc78 0 508 2 0x00000000 [371777.594673] ffff883fed2afc78 ffff881fef0e8000 ffff883fed21c4c0 ffff883fed2b0000 [371777.594678] ffff883ff236e298 ffff883ff236e000 ffff883ff236e018 ffff883ff236e018 By enabling more scsi logs(in a different system), including LL, we get some more info ( I have enabled the full scsi levels for all categories and it is still running) System 2 [405197.171574] sd 5:0:0:0: [sg3] sg_write: count=3D88 [405197.171577] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1 [405197.171581] sd 5:0:0:0: [sg3] sg_common_write: scsi opcode=3D0x85, cmd_size=3D16 [405197.171583] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D512 [405197.171605] sd 5:0:0:0: [sg3] sg_link_reserve: size=3D512 [405197.171623] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104 [405197.172648] sd 5:0:0:0: [sg3] sg_cmd_done: pack_id=3D0, res=3D0x0 [405197.172701] sd 5:0:0:0: [sg3] sg_poll: res=3D0x145 [405197.172710] sd 5:0:0:0: [sg3] sg_read: count=3D88 [405197.172716] sd 5:0:0:0: [sg3] sg_finish_rem_req: res_used=3D1 [405197.172721] sd 5:0:0:0: [sg3] sg_unlink_reserve: req->k_use_sg=3D1 [405197.172756] sd 5:0:0:0: [sg3] sg_write: count=3D88 [405197.172760] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1 [405197.172764] sd 5:0:0:0: [sg3] sg_common_write: scsi opcode=3D0x85, cmd_size=3D16 [405197.172767] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D512 [405197.172774] sd 5:0:0:0: [sg3] sg_link_reserve: size=3D512 [405197.172793] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104 [405197.173806] sd 5:0:0:0: [sg3] sg_cmd_done: pack_id=3D0, res=3D0x0 [405197.173829] sd 5:0:0:0: [sg3] sg_poll: res=3D0x145 [405197.173836] sd 5:0:0:0: [sg3] sg_read: count=3D88 [405197.173839] sd 5:0:0:0: [sg3] sg_finish_rem_req: res_used=3D1 [405197.173843] sd 5:0:0:0: [sg3] sg_unlink_reserve: req->k_use_sg=3D1 [405197.173893] sd 5:0:0:0: [sg3] sg_write: count=3D88 [405197.173896] sd 5:0:0:0: scsi_block_when_processing_errors: rtn: 1 [405197.173900] sd 5:0:0:0: [sg3] sg_common_write: scsi opcode=3D0x85, cmd_size=3D16 [405197.173903] sd 5:0:0:0: [sg3] sg_start_req: dxfer_len=3D0 [405197.173921] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104 [405197.174082] sd 5:0:0:0: [sg3] tag#5 abort scheduled [405197.179372] sd 5:0:0:0: [sg3] tag#5 aborting command [405197.179378] sd 5:0:0:0: [sg3] tag#5 cmd abort failed [405257.225584] sd 5:0:0:0: [sg3] sg_poll: res=3D0x104 [405257.225599] sd 5:0:0:0: [sg3] sg_release [405257.225617] sd 5:0:0:0: [sg3] sg_open: flags=3D0x8002 Error handler thread is not waken up here (or) i dont see flush command trace (even if it is already woken up) [405349.082863] INFO: task md2_raid1:484 blocked for more than 120 seconds [405349.082922] Not tainted 4.4.0-31-generic #50~14.04.1-Ubuntu [405349.082964] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [405349.083016] md2_raid1 D ffff881feca2bc78 0 484 2 0x00000000
Attachment:
udevadm_sdd_130.2
Description: Binary data