Tetsuo, thank you very much for your comments. They clarified one of my misunderstanding points. On Sep 30, 2022 / 20:06, Tetsuo Handa wrote: > On 2022/09/30 9:19, Shinichiro Kawasaki wrote: > > Kernel v6.0-rc7 triggered the WARN "possible circular locking dependency > > detected" at blktests test case block/011 for NVME devices [1]. The trigger > > commit was c0feea594e05 ("workqueue: don't skip lockdep work dependency in > > cancel_work_sync()"). The commit was backported to v5.15.71 stable kernel and > > the WARN is observed with the stable kernel also. It looks that the possible > > deadlock scenario has been hidden for long time and the commit revealed it now. > > Right. > > > I tried to understand the deadlock scenario, but can not tell if it is for real, > > or false-positive. Help to understand the scenario will be appreciated. The > > lockdep splat mentions three locks (ctrl->namespaces_rwsem, dev->shutdown_lock > > and q->timeout_work). > > Right. > > This is a circular locking problem which involves three locks. > > (A) (work_completion)(&q->timeout_work) --> &dev->shutdown_lock > > (B) &dev->shutdown_lock --> &ctrl->namespaces_rwsem > > (C) &ctrl->namespaces_rwsem --> (work_completion)(&q->timeout_work) > > (A) and (B) happens due to > > INIT_WORK(&q->timeout_work, blk_mq_timeout_work); > > blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync(). > blk_mq_queue_tag_busy_iter(blk_mq_check_expired) { > bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() { > blk_mq_rq_timed_out() { > req->q->mq_ops->timeout(req) == nvme_timeout(req) { > nvme_dev_disable() { > mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock > nvme_start_freeze(&dev->ctrl) { > down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block > up_read(&ctrl->namespaces_rwsem); > } > mutex_unlock(&dev->shutdown_lock); > } > } > } > } > } > } > > (C) happens due to > > nvme_sync_queues(ctrl) { > nvme_sync_io_queues(ctrl) { > down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block > /*** Question comes here. Please check the last block in this mail. ***/ > blk_sync_queue(ns->queue) { > cancel_work_sync(&q->timeout_work) { > __flush_work((&q->timeout_work, true) { > // Might wait for completion of blk_mq_timeout_work() with ctrl->namespaces_rwsem held. > } > } > } > up_read(&ctrl->namespaces_rwsem); > } > } > > > In the related function call paths, it looks that > > ctrl->namespaces_rwsem is locked only for read, so the deadlock scenario does > > not look possible for me. > > Right. > > > (Maybe I'm misunderstanding something...) > > Although ctrl->namespaces_rwsem is a rw-sem and is held for read in these paths, > there are paths which hold ctrl->namespaces_rwsem for write. If someone is trying > to hold that rw-sem for write, these down_read(&ctrl->namespaces_rwsem) will be blocked. Ah, that's the point. If write lock comes in parallel between two read locks, the read locks can cause circulating deadlock. > > That is, it also depends on whether > down_read(&ctrl->namespaces_rwsem); > and > down_write(&ctrl->namespaces_rwsem); > might run in parallel. > > Question: > Does someone else call down_write(&ctrl->namespaces_rwsem) immediately after > someone's down_read(&ctrl->namespaces_rwsem) succeeded? > > nvme_alloc_ns()/nvme_ns_remove()/nvme_remove_invalid_namespaces()/nvme_remove_namespaces() calls > down_write(&ctrl->namespaces_rwsem), and some of these are called from work callback functions > which might run in parallel with other work callback functions? Though I'm new to NVME code, let me try to answer your question based on my understanding. (NVME experts' comments are more valuable. Please chime in.) NVME host manages dev->ctrl.state not to run contradicting works. Then scan work, delete work and reset work do not run in parallel in most cases. So the write lock by the scan/delete work does not happen in parallel with the reset work which calls nvme_sync_queues() and does the read lock, in most cases. Having said that still I can think of a rare scenario [1] that does the write lock after the read lock in reset work context. Then I think the answer to your question is "Yes". From this point of view, the deadlock scenario is very rare but still looks possible, in theory. BTW, I came up to another question during code read. I found nvme_reset_work() calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME controller is already disabled when the reset work calls nvme_sync_queues(). Then another nvme_dev_disable() call in I/O timeout path does not result in nvme_start_freeze() and the 2nd read lock. From this point of view, the deadlock scenario does not look possible. If this guess is correct, why the lockdep splat shows the deadlock scenario?? In addition, I found nvme_reset_prepare() and nvme_suspend() also call nvme_sync_queues(). These functions are not in the lockdep splat, but may have similar deadlock scenario. I will spend some more time on them tomorrow. [1] TASK A scan work TASK TASK B reset work TASK -------------------------------------------------------------------------------- nvme_sysfs_rescan() nvme_queue_scan() check status is LIVE queue scan work nvme_scan_work() check status is LIVE nvme_sysfs_reset() nvme_reset_ctrl() change status to RESETTING queue reset work nvme_reset_work() check status is RESETTING nvme_sync_queues() read lock ctrl->namespace_rwsem write lock ctrl->namespace_rwsem -- Shin'ichiro Kawasaki