Re: lockdep WARNING at blktests block/011

Shinichiro Kawasaki <shinichiro.kawasaki@xxxxxxx> · Mon, 3 Oct 2022 13:32:41 +0000

Tetsuo, thank you very much for your comments. They clarified one of my
misunderstanding points.

On Sep 30, 2022 / 20:06, Tetsuo Handa wrote:
> On 2022/09/30 9:19, Shinichiro Kawasaki wrote:
> > Kernel v6.0-rc7 triggered the WARN "possible circular locking dependency
> > detected" at blktests test case block/011 for NVME devices [1]. The trigger
> > commit was c0feea594e05 ("workqueue: don't skip lockdep work dependency in
> > cancel_work_sync()"). The commit was backported to v5.15.71 stable kernel and
> > the WARN is observed with the stable kernel also. It looks that the possible
> > deadlock scenario has been hidden for long time and the commit revealed it now.
> 
> Right.
> 
> > I tried to understand the deadlock scenario, but can not tell if it is for real,
> > or false-positive. Help to understand the scenario will be appreciated. The
> > lockdep splat mentions three locks (ctrl->namespaces_rwsem, dev->shutdown_lock
> > and q->timeout_work).
> 
> Right.
> 
> This is a circular locking problem which involves three locks.
> 
>   (A) (work_completion)(&q->timeout_work) --> &dev->shutdown_lock
> 
>   (B) &dev->shutdown_lock --> &ctrl->namespaces_rwsem
> 
>   (C) &ctrl->namespaces_rwsem --> (work_completion)(&q->timeout_work)
> 
> (A) and (B) happens due to
> 
>   INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
> 
>   blk_mq_timeout_work(work) { // Is blocking __flush_work() from cancel_work_sync().
>     blk_mq_queue_tag_busy_iter(blk_mq_check_expired) {
>       bt_for_each(blk_mq_check_expired) == blk_mq_check_expired() {
>         blk_mq_rq_timed_out() {
>           req->q->mq_ops->timeout(req) == nvme_timeout(req) {
>             nvme_dev_disable() {
>               mutex_lock(&dev->shutdown_lock); // Holds dev->shutdown_lock
>               nvme_start_freeze(&dev->ctrl) {
>                 down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
>                 up_read(&ctrl->namespaces_rwsem);
>               }
>               mutex_unlock(&dev->shutdown_lock);
>             }
>           }
>         }
>       }
>     }
>   }
> 
> (C) happens due to
> 
>   nvme_sync_queues(ctrl) {
>     nvme_sync_io_queues(ctrl) {
>       down_read(&ctrl->namespaces_rwsem); // Holds ctrl->namespaces_rwsem which might block
>       /*** Question comes here. Please check the last block in this mail. ***/
>       blk_sync_queue(ns->queue) {
> 	cancel_work_sync(&q->timeout_work) {
>           __flush_work((&q->timeout_work, true) {
>             // Might wait for completion of blk_mq_timeout_work() with ctrl->namespaces_rwsem held.
>           }
>         }
>       }
>       up_read(&ctrl->namespaces_rwsem);
>     }
>   }
> 
> >                       In the related function call paths, it looks that
> > ctrl->namespaces_rwsem is locked only for read, so the deadlock scenario does
> > not look possible for me.
> 
> Right.
> 
> >                           (Maybe I'm misunderstanding something...)
> 
> Although ctrl->namespaces_rwsem is a rw-sem and is held for read in these paths,
> there are paths which hold ctrl->namespaces_rwsem for write. If someone is trying
> to hold that rw-sem for write, these down_read(&ctrl->namespaces_rwsem) will be blocked.

Ah, that's the point. If write lock comes in parallel between two read locks,
the read locks can cause circulating deadlock.

> 
> That is, it also depends on whether
>   down_read(&ctrl->namespaces_rwsem);
> and
>   down_write(&ctrl->namespaces_rwsem);
> might run in parallel.
> 
> Question:
>   Does someone else call down_write(&ctrl->namespaces_rwsem) immediately after
>   someone's down_read(&ctrl->namespaces_rwsem) succeeded?
> 
> nvme_alloc_ns()/nvme_ns_remove()/nvme_remove_invalid_namespaces()/nvme_remove_namespaces() calls
> down_write(&ctrl->namespaces_rwsem), and some of these are called from work callback functions
> which might run in parallel with other work callback functions?

Though I'm new to NVME code, let me try to answer your question based on my
understanding. (NVME experts' comments are more valuable. Please chime in.)

NVME host manages dev->ctrl.state not to run contradicting works. Then scan
work, delete work and reset work do not run in parallel in most cases. So the
write lock by the scan/delete work does not happen in parallel with the reset
work which calls nvme_sync_queues() and does the read lock, in most cases.
Having said that still I can think of a rare scenario [1] that does the write
lock after the read lock in reset work context. Then I think the answer to your
question is "Yes". From this point of view, the deadlock scenario is very rare
but still looks possible, in theory.

BTW, I came up to another question during code read. I found nvme_reset_work()
calls nvme_dev_disable() before nvme_sync_queues(). So, I think the NVME
controller is already disabled when the reset work calls nvme_sync_queues().
Then another nvme_dev_disable() call in I/O timeout path does not result in
nvme_start_freeze() and the 2nd read lock. From this point of view, the deadlock
scenario does not look possible. If this guess is correct, why the lockdep splat
shows the deadlock scenario??

In addition, I found nvme_reset_prepare() and nvme_suspend() also call
nvme_sync_queues(). These functions are not in the lockdep splat, but may have
similar deadlock scenario. I will spend some more time on them tomorrow.

[1]

TASK A            scan work TASK       TASK B            reset work TASK
--------------------------------------------------------------------------------
nvme_sysfs_rescan()
nvme_queue_scan()
check status is LIVE
queue scan work
                  nvme_scan_work()
		  check status is LIVE
                                       nvme_sysfs_reset()
                                       nvme_reset_ctrl()
                                       change status to RESETTING
                                       queue reset work
                                                         nvme_reset_work()
                                                         check status is RESETTING
                                                         nvme_sync_queues()
                                                         read lock ctrl->namespace_rwsem
		  write lock ctrl->namespace_rwsem

-- 
Shin'ichiro Kawasaki