Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Tue, 15 Mar 2022 19:57:25 +0900

On 2022/03/15 17:44, Christoph Hellwig wrote:
> On Mon, Mar 14, 2022 at 04:23:18PM +0100, Jan Kara wrote:
>> Honestly, the anon inode trick makes the code pretty much unreadable. As
>> far as I remember Christoph was not pricipially against using task_work. He
>> just wanted the task_work API to be improved so that it is easier to use.
> 
> This whole patch is awful.  And no, I don't think task_work_add really has
> any business here.  We need to properly unwind the mess instead of piling
> things higher and higher.

How do you plan to unwind the mess?



On 2022/03/15 0:23, Jan Kara wrote:
> On Mon 14-03-22 00:15:22, Tetsuo Handa wrote:
>> syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
>> commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
>> is calling destroy_workqueue() with disk->open_mutex held.
>>
>> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
>> tried to fix this problem by deferring __loop_clr_fd() from lo_release()
>> to a WQ context, but that commit was reverted because there are userspace
>> programs which depend on that __loop_clr_fd() completes by the moment
>> close() returns to user mode.
>>
>> This time, we try to fix this problem by deferring __loop_clr_fd() from
>> lo_release() to a task work context. Since task_work_add() is not exported
>> but Christoph does not want to export it, this patch uses anonymous inode
>> interface as a wrapper for calling task_work_add() via fput().
>>
>> Although Jan has found two bugs in /bin/mount where one of these was fixed
>> in util-linux package [2], I found that not holding lo->lo_mutex from
>> lo_open() causes occasional I/O error [3] (due to
>>
>> 	if (lo->lo_state != Lo_bound)
>> 		return BLK_STS_IOERR;
>>
>> check in loop_queue_rq()) when systemd-udevd opens a loop device and
>> reads from it before loop_configure() updates lo->lo_state to Lo_bound.
> 
> Thanks for detailed explanation here. Finally, I can see what is the
> problem with the open path. A few questions here:
> 
> 1) Can you please remind me why do we need to also remove the lo_mutex from
> lo_open()? The original lockdep problem was only for __loop_clr_fd(),
> wasn't it?

Right, but from locking dependency chain perspective, holding lo->lo_mutex from
lo_open()/lo_release() is treated as if lo_open()/lo_release() waits for flush
of I/O with lo->lo_mutex held, even if it is not lo_open()/lo_release() which
actually flushes I/O with lo->lo_mutex held.

We need to avoid holding lo->lo_mutex from lo_open()/lo_release() while
avoiding problems caused by not waiting for loop_configure()/__loop_clr_fd().

> 
> 2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event
> after the moment the loop device is configured? That way systemd won't
> start probing the new loop device until it is fully configured.

Do you mean

--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1024,7 +1024,6 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
                goto out_unlock;
        }

-       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
        set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0);

        INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
@@ -1066,6 +1065,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
        wmb();

        lo->lo_state = Lo_bound;
+       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
        if (part_shift)
                lo->lo_flags |= LO_FLAGS_PARTSCAN;
        partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;

? Maybe possible, but looks strange. Such change will defer only
open()->read() sequence triggered via kobject_uevent(KOBJ_CHANGE).
My patch defers all open() from userspace, like we did until 5.11.

> 
> Because the less hacks with task_work we have the better.
> 
> Also I don't think the deadlock is really fixed when you wait for
> __loop_clr_fd() during open. It is maybe just hidden from lockdep.

Since lo_post_open() is called with no locks held, how can waiting
for __loop_clr_fd() after lo_open() can cause deadlock? The reason to
choose task_work context is to synchronize without any other locks held.

Until 5.11, lo_open() and loop_configure() were serialized using
loop_ctl_mutex. In 5.12, lo_open() and loop_configure() started using
lo->lo_mutex, and serialization among multiple loop devices was fixed
in 5.13 using loop_validate_mutex. Therefore, basically we had been
waiting for loop_configure() and __loop_clr_fd() during open. My patch
just moves waiting for loop_configure() and __loop_clr_fd() in open
to outside of system_transition_mutex/1 and disk->open_mutex.

>                                                                    But the
> fundamental problem of the cycle syzbot constructed is that flushing of IO
> to a loop device may end up depending on open of that loop device to
> finish.

How can my patch depend on open of that loop device? It is task_work context.

>         So if pending __loop_clr_fd() is a problem for somebody (is it?),
> we still need to let open finish and deal with waiting for __loop_clr_fd()
> in other places where that actually causes problem (and make sure
> Lo_rundown is either treated as Lo_unbound or wait for Lo_rundown to
> transition to Lo_unbound).

lo_post_open() lets open finish and deal with waiting for __loop_clr_fd().