Re: [PATCH v2] loop: don't hold lo->lo_mutex from lo_open() and lo_release()

Jan Kara <jack@xxxxxxx> · Wed, 16 Mar 2022 13:26:14 +0100

On Tue 15-03-22 19:57:25, Tetsuo Handa wrote:
> On 2022/03/15 17:44, Christoph Hellwig wrote:
> On 2022/03/15 0:23, Jan Kara wrote:
> > On Mon 14-03-22 00:15:22, Tetsuo Handa wrote:
> >> syzbot is reporting circular locking problem at __loop_clr_fd() [1], for
> >> commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker")
> >> is calling destroy_workqueue() with disk->open_mutex held.
> >>
> >> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous")
> >> tried to fix this problem by deferring __loop_clr_fd() from lo_release()
> >> to a WQ context, but that commit was reverted because there are userspace
> >> programs which depend on that __loop_clr_fd() completes by the moment
> >> close() returns to user mode.
> >>
> >> This time, we try to fix this problem by deferring __loop_clr_fd() from
> >> lo_release() to a task work context. Since task_work_add() is not exported
> >> but Christoph does not want to export it, this patch uses anonymous inode
> >> interface as a wrapper for calling task_work_add() via fput().
> >>
> >> Although Jan has found two bugs in /bin/mount where one of these was fixed
> >> in util-linux package [2], I found that not holding lo->lo_mutex from
> >> lo_open() causes occasional I/O error [3] (due to
> >>
> >> 	if (lo->lo_state != Lo_bound)
> >> 		return BLK_STS_IOERR;
> >>
> >> check in loop_queue_rq()) when systemd-udevd opens a loop device and
> >> reads from it before loop_configure() updates lo->lo_state to Lo_bound.
> > 
> > Thanks for detailed explanation here. Finally, I can see what is the
> > problem with the open path. A few questions here:
> > 
> > 1) Can you please remind me why do we need to also remove the lo_mutex from
> > lo_open()? The original lockdep problem was only for __loop_clr_fd(),
> > wasn't it?
> 
> Right, but from locking dependency chain perspective, holding
> lo->lo_mutex from lo_open()/lo_release() is treated as if
> lo_open()/lo_release() waits for flush of I/O with lo->lo_mutex held,
> even if it is not lo_open()/lo_release() which actually flushes I/O with
> lo->lo_mutex held.
> 
> We need to avoid holding lo->lo_mutex from lo_open()/lo_release() while
> avoiding problems caused by not waiting for loop_configure()/__loop_clr_fd().

I thought the whole point of these patches is to move waiting for IO from
under lo->lo_mutex and disk->open_mutex? And if we do that, there's no
problem with open needing either of these mutexes? What am I missing?

Anyway, I have no problem with removing lo->lo_mutex from lo_open() as e.g.
Christoph done it. I just disliked the task work magic you did there...

> > 2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event
> > after the moment the loop device is configured? That way systemd won't
> > start probing the new loop device until it is fully configured.
> 
> Do you mean
> 
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1024,7 +1024,6 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>                 goto out_unlock;
>         }
> 
> -       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
>         set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0);
> 
>         INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
> @@ -1066,6 +1065,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode,
>         wmb();
> 
>         lo->lo_state = Lo_bound;
> +       disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE);
>         if (part_shift)
>                 lo->lo_flags |= LO_FLAGS_PARTSCAN;
>         partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
> 
> ? Maybe possible, but looks strange.

Yes, something like that. But disk_force_media_change() has some
invalidation in it as well and that may need to stay in the original place
- needs a bit more thought & validation.

> Such change will defer only open()->read() sequence triggered via
> kobject_uevent(KOBJ_CHANGE). My patch defers all open() from userspace,
> like we did until 5.11.

Correct. But EIO error is the correct return if you access unbound loop
device. It is only after we tell userspace the device exists (i.e., after
we send the event), when we should better make sure the IO succeeds.

> > Because the less hacks with task_work we have the better.
> > 
> > Also I don't think the deadlock is really fixed when you wait for
> > __loop_clr_fd() during open. It is maybe just hidden from lockdep.
> 
> Since lo_post_open() is called with no locks held, how can waiting
> for __loop_clr_fd() after lo_open() can cause deadlock? The reason to
> choose task_work context is to synchronize without any other locks held.

Sorry, I was wrong here. It will not cause deadlock. But in-kernel openers
of the block device like swsusp_check() in kernel/power/swap.c (using
blkdev_get_by_dev()) will be now operating on a loop device without waiting
for __loop_clr_fd(). Which is I guess fine but the inconsistency between
in-kernel and userspace bdev openers would be a bit weird. Anyway,
Christophs patches look like a cleaner way forward to me...

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR