SCSI layer RPM deadlock debug suggestion

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi guys,

We're experiencing a deadlock between trying to remove a SATA device and doing a rescan in scsi_rescan_device().

I'm just looking for a suggestion on how to solve.

The background is that the host (hisi sas v3 hw) uses SAS SCSI transport and supports RPM. In the testcase, the host and disks are put to suspend. Then we run fio on the disk to make them active and then immediately hard reset the disk link, which causes the disk to be disconnected (please don't ask why ...).

We find that there is a conflict between the rescan and the device removal code, resulting in a deadlock:

a 1158050441d:06[ 607.429281] Call trace:
[ 607.433083] __switch_to+0x164/0x1d4
[ 607.437596] __schedule+0x8f8/0x1450
[ 607.441183] schedule+0x7c/0x110
[ 607.444422] blk_queue_enter+0x290/0x490
[ 607.448358] blk_mq_alloc_request+0x50/0xb4
[ 607.452547] blk_get_request+0x38/0x80
[ 607.456305] __scsi_execute+0x6c/0x1c4
[ 607.460064] scsi_vpd_inquiry+0x88/0xf0
[ 607.463908] scsi_get_vpd_buf+0x68/0xb0
[ 607.467752] scsi_attach_vpd+0x58/0x170
[ 607.471596] scsi_rescan_device+0x40/0xac
[ 607.475612] ata_scsi_dev_rescan+0xb4/0x14c
[ 607.479802] process_one_work+0x29c/0x6fc
[ 607.483819] worker_thread+0x80/0x470
[ 607.487489] kthread+0x15c/0x170
[ 607.490727] ret_from_fork+0x10/0x18

sas_phy_event_worker [libsas]
[ 607.529831] Call trace:
[ 607.532312] __switch_to+0x164/0x1d4
[ 607.535900] __schedule+0x8f8/0x1450
[ 607.539484] schedule+0x7c/0x110
[ 607.542724] schedule_preempt_disabled+0x30/0x4c
[ 607.547345] __mutex_lock+0x308/0x8b0
[ 607.551016] mutex_lock_nested+0x44/0x70
[ 607.554947] device_del+0x4c/0x450
[ 607.558341] __scsi_remove_device+0x11c/0x14c
[ 607.562702] scsi_remove_target+0x1bc/0x240
[ 607.566891] sas_rphy_remove+0x90/0x94
[ 607.570649] sas_rphy_delete+0x24/0x40
[ 607.574388] sas_destruct_devices+0x64/0xa0 [libsas]
[ 607.579359] sas_deform_port+0x178/0x1bc [libsas]
[ 607.584069] sas_phye_loss_of_signal+0x28/0x34 [libsas]
[ 607.589298] sas_phy_event_worker+0x34/0x50 [libsas]
[ 607.594268] process_one_work+0x29c/0x6fc
[ 607.598284] worker_thread+0x80/0x470
[ 607.601955] kthread+0x15c/0x170
[ 607.605193] ret_from_fork+0x10/0x18
[ 607.608845] INFO: task fio:3382 blocked for more than 121

The rescan holds the sdev_gendev.device lock in scsi_rescan_device(), while the removal code in __scsi_remove_device() wants to grab it.

However the rescan will not release (the lock) until the blk_queue_enter() succeeds, above. That can happen 2x ways:

- the queue is dying, which would not happen until after the device_del() in __scsi_remove_device(), so not going to happen

- q->pm_only falls to 0. This would be when scsi_runtime_resume() -> sdev_runtime_resume() -> blk_post_runtime_resume(err = 0) -> blk_set_runtime_active() is called. However, I find that the err argument for me is -5, which comes from sdev_runtime_resume() -> pm->runtime_resume (=sd_resume()), which fails. That sd_resume() -> sd_start_stop_device() fails as the disk is not attached. So we go into error state:

$:more /sys/devices/pci0000:b4/0000:b4:04.0/host3/port-3:0/end_device-3:0/target3:0:0/3:0:0:0/power/runtime_status
error

Removing commit e27829dc92e5 ("scsi: serialize ->rescan against ->remove") solves this issue for me, but that is there for a reason.

Any suggestion on how to fix this deadlock?

Thanks,
John



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux