Re: [PATCH] bcache: remove unnecessary flush_workqueue

Lei Li <noctis.akm@xxxxxxxxx> · Sun, 29 May 2022 02:11:42 +0800

Sorry for replying so late, because of my testing environment. After
several fio random writing and reboot tests,
this patch worked well. Here is the test result.

1. I successfully reproduced deadlock warning by reverting commit
7e865eba00a3 ("bcache: fix potential
deadlock in cached_def_free()").
[  105.448074] ======================================================
[  105.449634] WARNING: possible circular locking dependency detected
[  105.451185] 5.4.191-MANJARO #1 Not tainted
[  105.452221] ------------------------------------------------------
[  105.453742] kworker/9:15/1272 is trying to acquire lock:
[  105.455042] ffff9e9ef1c44348
((wq_completion)bcache_writeback_wq){+.+.}, at:
flush_workqueue+0x8a/0x550
[  105.457482]
[  105.457482] but task is already holding lock:
[  105.458926] ffffbaba01203e68
((work_completion)(&cl->work)#2){+.+.}, at:
process_one_work+0x1c3/0x5a0
[  105.461568]
[  105.461568] which lock already depends on the new lock.
[  105.461568]
[  105.463810]
[  105.463810] the existing dependency chain (in reverse order) is:
[  105.465570]
[  105.465570] -> #1 ((work_completion)(&cl->work)#2){+.+.}:
[  105.467230]        process_one_work+0x21a/0x5a0
[  105.468449]        worker_thread+0x52/0x3c0
[  105.469653]        kthread+0x132/0x150
[  105.470695]        ret_from_fork+0x3a/0x50
[  105.471649]
[  105.471649] -> #0 ((wq_completion)bcache_writeback_wq){+.+.}:
[  105.473453]        __lock_acquire+0x105b/0x1c50
[  105.474678]        lock_acquire+0xc4/0x1b0
[  105.476007]        flush_workqueue+0xad/0x550
[  105.477160]        drain_workqueue+0xb6/0x170
[  105.478237]        destroy_workqueue+0x36/0x290
[  105.479328]        cached_dev_free+0x45/0x1e0 [bcache]
[  105.480595]        process_one_work+0x243/0x5a0
[  105.481714]        worker_thread+0x52/0x3c0
[  105.482687]        kthread+0x132/0x150
[  105.483667]        ret_from_fork+0x3a/0x50
[  105.484707]
[  105.484707] other info that might help us debug this:
[  105.484707]
[  105.486856]  Possible unsafe locking scenario:
[  105.486856]
[  105.488314]        CPU0                    CPU1
[  105.489336]        ----                    ----
[  105.490438]   lock((work_completion)(&cl->work)#2);
[  105.491587]
lock((wq_completion)bcache_writeback_wq);
[  105.493554]
lock((work_completion)(&cl->work)#2);
[  105.495360]   lock((wq_completion)bcache_writeback_wq);
[  105.496878]
[  105.496878]  *** DEADLOCK ***
[  105.496878]
[  105.498303] 2 locks held by kworker/9:15/1272:
[  105.499373]  #0: ffff9e9f16c21148 ((wq_completion)events){+.+.},
at: process_one_work+0x1c3/0x5a0
[  105.501514]  #1: ffffbaba01203e68
((work_completion)(&cl->work)#2){+.+.}, at:
process_one_work+0x1c3/0x5a0
[  105.504221]
[  105.504221] stack backtrace:
[  105.505271] CPU: 9 PID: 1272 Comm: kworker/9:15 Not tainted
5.4.191-MANJARO #1
[  105.507069] Hardware name: VMware, Inc. VMware Virtual
Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020
[  105.509485] Workqueue: events cached_dev_free [bcache]
[  105.510805] Call Trace:
[  105.511471]  dump_stack+0x8b/0xbd
[  105.512327]  check_noncircular+0x198/0x1b0
[  105.513462]  __lock_acquire+0x105b/0x1c50
[  105.514675]  lock_acquire+0xc4/0x1b0
[  105.515557]  ? flush_workqueue+0x8a/0x550
[  105.516535]  flush_workqueue+0xad/0x550
[  105.517587]  ? flush_workqueue+0x8a/0x550
[  105.518530]  ? drain_workqueue+0xb6/0x170
[  105.519518]  drain_workqueue+0xb6/0x170
[  105.520572]  destroy_workqueue+0x36/0x290
[  105.521658]  cached_dev_free+0x45/0x1e0 [bcache]
[  105.522879]  process_one_work+0x243/0x5a0
[  105.523854]  worker_thread+0x52/0x3c0
[  105.524734]  ? process_one_work+0x5a0/0x5a0
[  105.525768]  kthread+0x132/0x150
[  105.526594]  ? __kthread_bind_mask+0x60/0x60
[  105.527706]  ret_from_fork+0x3a/0x50
[  105.571043] bcache: bcache_device_free() bcache0 stopped
[  105.730801] bcache: cache_set_free() Cache set
74f01341-0881-4c77-a49b-39fafcabb99e unregistered
[  105.730819] bcache: bcache_reboot() All devices stopped
[  105.930055] reboot: Restarting system
[  105.930996] reboot: machine restart
[  105.932182] ACPI MEMORY or I/O RESET_REG.

2. After applied 7e865eba00a3 and this patch, no warning showed again.
[   58.296057] bcache: bcache_reboot() Stopping all devices:
[   58.337730] bcache: bcache_device_free() bcache0 stopped
[   58.484177] bcache: cache_set_free() Cache set
74f01341-0881-4c77-a49b-39fafcabb99e unregistered
[   58.484202] bcache: bcache_reboot() All devices stopped
[   58.731929] reboot: Restarting system
[   58.732845] reboot: machine restart
[   58.734044] ACPI MEMORY or I/O RESET_REG.

Coly Li <colyli@xxxxxxx> 于2022年5月13日周五 23:17写道：
>
> On 4/9/22 9:17 AM, 李磊 wrote:
> > Coly Li <colyli@xxxxxxx> 于2022年4月8日周五 00:54写道：
> >> On 3/27/22 3:20 PM, Li Lei wrote:
> >>> All pending works will be drained by destroy_workqueue(), no need to call
> >>> flush_workqueue() explicitly.
> >>>
> >>> Signed-off-by: Li Lei <lilei@xxxxxxxxxxxxxxx>
> >>> ---
> >>>    drivers/md/bcache/writeback.c | 5 ++---
> >>>    1 file changed, 2 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
> >>> index 9ee0005874cd..2a6d9f39a9b1 100644
> >>> --- a/drivers/md/bcache/writeback.c
> >>> +++ b/drivers/md/bcache/writeback.c
> >>> @@ -793,10 +793,9 @@ static int bch_writeback_thread(void *arg)
> >>>                }
> >>>        }
> >>>
> >>> -     if (dc->writeback_write_wq) {
> >>> -             flush_workqueue(dc->writeback_write_wq);
> >>> +     if (dc->writeback_write_wq)
> >>>                destroy_workqueue(dc->writeback_write_wq);
> >>> -     }
> >>> +
> >>>        cached_dev_put(dc);
> >>>        wait_for_kthread_stop();
> >>>
> >> The above code is from commit 7e865eba00a3 ("bcache: fix potential
> >> deadlock in cached_def_free()"). I explicitly added extra
> >> flush_workqueue() was because of workqueue_sysfs_unregister(wq) in
> >> destory_workqueue().
> >>
> >> workqueue_sysfs_unregister() is not simple because it calls
> >> device_unregister(), and the code path is long. During reboot I am not
> >> sure whether extra deadlocking issue might be introduced. So the safe
> >> way is to explicitly call flush_workqueue() firstly to wait for all
> >> kwork finished, then destroy it.
> >>
> >> It has been ~3 years passed, now I am totally OK with your above change.
> >> But could you please test your patch with lockdep enabled, and see
> >> whether there is no lock warning observed? Then I'd like to add it into
> >> my test directory.
> >>
> > OK，I will test this scenario.
>
>
> Any progress?
>
>
> Coly Li
>