Hang in __scsi_execute / scsi_block_when_processing_errors

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi linux-scsi list,

I recently encountered the following kernel hang (i.e. all accessing processes got stuck in interruptible IO wait) when an SSD attached to an AHCI controller failed catastrophically. That disk was part of a software RAID which got taken down by that and only recovered after a reboot. I captured all kernel call stacks of the stuck threads and the only ones I think are related to disk access are these two:

[15311630.338000] task:smart-exporter state:D stack: 0 pid:1519832 ppid: 1 flags:0x00000080
[15311630.347127] Call Trace:
[15311630.350345] __schedule+0x457/0x860
[15311630.354722] schedule+0x46/0xb0
[15311630.358740] scsi_block_when_processing_errors+0xb7/0xc0 [scsi_mod]
[15311630.365882] ? wait_woken+0x70/0x70
[15311630.370267] sd_print_result+0x2b18/0x3500 [sd_mod]
[15311630.376034] __blkdev_get+0x22c/0x620
[15311630.380580] blkdev_get+0x40/0xb0
[15311630.384753] ? bd_acquire+0xc0/0xc0
[15311630.389103] do_dentry_open+0x14b/0x360
[15311630.393796] path_openat+0xa57/0x1010
[15311630.398327] ? perf_mux_hrtimer_restart+0x3f/0xb0
[15311630.403912] ? perf_event_update_userpage+0xd8/0x120
[15311630.409696] do_filp_open+0x91/0x100
[15311630.413872] ? perf_mux_hrtimer_handler+0x130/0x370
[15311630.419367] ? perf_event_addr_filters_sync+0x80/0x80
[15311630.425006] ? __check_object_size+0x136/0x150
[15311630.430035] do_sys_openat2+0x222/0x2e0
[15311630.434675] do_sys_open+0x44/0x80
[15311630.438665] do_syscall_64+0x33/0x40
[15311630.442814] entry_SYSCALL_64_after_hwframe+0x44/0xa9

That is smart-exporter (https://github.com/lorenz/smart-exporter) getting stuck even though it sets a timeout of 2s on its SCSI operation.

[15311639.315986] task:bash state:D stack: 0 pid:809422 ppid:809421 flags:0x00004084
[15311639.325262] Call Trace:
[15311639.328583] __schedule+0x457/0x860
[15311639.332880] ? lock_timer_base+0x61/0x80
[15311639.337625] schedule+0x46/0xb0
[15311639.341372] schedule_timeout+0x199/0x2a0
[15311639.345979] ? __next_timer_interrupt+0x110/0x110
[15311639.351275] io_schedule_timeout+0x19/0x40
[15311639.355990] wait_for_completion_io_timeout+0x8b/0x100
[15311639.361724] blk_execute_rq+0x69/0xa0
[15311639.365982] __scsi_execute+0xe8/0x250 [scsi_mod]
[15311639.371286] sd_print_result+0x2f6b/0x3500 [sd_mod]
[15311639.376764] sd_print_result+0x33c7/0x3500 [sd_mod]
[15311639.382235] sd_print_result+0x34aa/0x3500 [sd_mod]
[15311639.387718] device_release_driver_internal+0xf6/0x1d0
[15311639.393454] bus_remove_device+0xdb/0x140
[15311639.398049] device_del+0x18b/0x420
[15311639.402128] ? ata_std_error_handler+0xd0/0x720 [libata]
[15311639.408030] ? attribute_container_device_trigger+0xb4/0xf0
[15311639.414206] __scsi_remove_device+0x118/0x150 [scsi_mod]
[15311639.420164] scsi_remove_device+0x22/0xc0 [scsi_mod]
[15311639.425731] scsi_remove_device+0x7f/0xc0 [scsi_mod]
[15311639.431307] kernfs_fop_write_iter+0x128/0x1c0
[15311639.436360] new_sync_write+0x11f/0x1b0
[15311639.440835] vfs_write+0x1c0/0x280
[15311639.444823] ksys_write+0x5f/0xe0
[15311639.448770] do_syscall_64+0x33/0x40
[15311639.452926] entry_SYSCALL_64_after_hwframe+0x44/0xa9

This is me trying to manually release the SCSI device (i.e. echo 1 > /sys/.../scsi_device/.../delete). I sadly don't have call stacks from before trying to do this.

Kernel log excerpt (tons of those in a very short time):
[15307375.115712] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[15307375.122465] ata3.00: irq_stat 0x40000001
[15307375.126716] ata3.00: failed command: FLUSH CACHE
[15307375.131628] ata3.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 23 res 51/04:00:00:00:00/00:00:00:00:00/a0 Emask 0x1 (device error)
[15307375.146278] ata3.00: status: { DRDY ERR }
[15307375.150596] ata3.00: error: { ABRT }
[15307375.154691] ata3.00: failed to read native max address (err_mask=0x1) [15307375.161442] ata3.00: HPA support seems broken, skipping HPA handling [15307375.168099] ata3.00: supports DRM functions and may not be fully accessible
[15307375.175377] ata3.00: failed to enable AA (error_mask=0x1)
[15307375.181122] ata3.00: NCQ Send/Recv Log not supported
[15307375.186435] ata3.00: failed to get Identify Device Data, Emask 0x1
[15307375.186467] ata3.00: ATA Identify Device Log not supported
[15307375.192273] ata3.00: Security Log not supported
[15307375.197244] ata3.00: supports DRM functions and may not be fully accessible
[15307375.204513] ata3.00: failed to enable AA (error_mask=0x1)
[15307375.210227] ata3.00: NCQ Send/Recv Log not supported
[15307375.215504] ata3.00: failed to get Identify Device Data, Emask 0x1
[15307375.215521] ata3.00: ATA Identify Device Log not supported
[15307375.221311] ata3.00: Security Log not supported
[15307375.226131] ata3.00: configured for UDMA/133 (device error ignored)
[15307375.233011] ata3.00: device reported invalid CHS sector 0
[15307375.239165] ata3: EH complete

[15307375.912022] sd 3:0:0:0: [sdc] tag#18 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [15307375.922363] sd 3:0:0:0: [sdc] tag#18 Sense Key : Illegal Request [current] [15307375.930014] sd 3:0:0:0: [sdc] tag#18 Add. Sense: Unaligned write command [15307375.937707] sd 3:0:0:0: [sdc] tag#18 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00 [15307375.946906] blk_update_request: I/O error, dev sdc, sector 0 op 0x0:(READ) flags 0x800 phys_seg 0 prio class 0 [15307375.957992] zio pool=hdd8x3 vdev=/dev/disk/by-id/ata-Samsung_SSD_850_EVO_120GB_S21UNXAGC03019B-part1 error=5 type=5 offset=0 size=0 flags=100480
[15307375.972083] ata3: EH complete
[15307376.742718] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[15307376.749882] ata3.00: irq_stat 0x40000001
[15307376.754518] ata3.00: failed command: WRITE DMA
[15307376.759749] ata3.00: cmd ca/00:08:e0:5b:17/00:00:00:00:00/e0 tag 2 dma 4096 out res 51/04:08:e0:5b:17/00:00:00:00:00/e0 Emask 0x1 (device error)
[15307376.776310] ata3.00: status: { DRDY ERR }
[15307376.781371] ata3.00: error: { ABRT }
[15307376.785817] ata3.00: supports DRM functions and may not be fully accessible
[15307376.793544] ata3.00: failed to enable AA (error_mask=0x1)
[15307376.799834] ata3.00: NCQ Send/Recv Log not supported
[15307376.805610] ata3.00: failed to get Identify Device Data, Emask 0x1
[15307376.805630] ata3.00: ATA Identify Device Log not supported
[15307376.811841] ata3.00: Security Log not supported
[15307376.817221] ata3.00: supports DRM functions and may not be fully accessible
[15307376.824942] ata3.00: failed to enable AA (error_mask=0x1)
[15307376.831139] ata3.00: NCQ Send/Recv Log not supported
[15307376.836839] ata3.00: failed to get Identify Device Data, Emask 0x1
[15307376.836859] ata3.00: ATA Identify Device Log not supported
[15307376.843070] ata3.00: Security Log not supported
[15307376.848308] ata3.00: configured for UDMA/133 (device error ignored)

Any idea why and where this locked up instead of either dropping the disk entirely (it could no longer be identified after all) or just rejected all IO (then the software RAID would've just thrown it out instead of hanging)?

Regards,
Lorenz





[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux