Re: [PATCH v2 2/2] loop: use task_work for autoclear operation

Jan Kara <jack@xxxxxxx> · Wed, 12 Jan 2022 14:16:15 +0100

On Tue 11-01-22 00:08:56, Tetsuo Handa wrote:
> On 2022/01/10 22:42, Jan Kara wrote:
> > a) We didn't fully establish a real deadlock scenario from the lockdep
> > report, did we? The lockdep report involved suspend locks, some locks on
> > accessing files in /proc etc. and it was not clear whether it all reflects
> > a real deadlock possibility or just a fact that lockdep tracking is rather
> > coarse-grained at times. Now lockdep report is unpleasant and loop device
> > locking was ugly anyway so your async change made sense but once lockdep is
> > silenced we should really establish whether there is real deadlock and more
> > work is needed or not.
> 
> Not /proc files but /sys/power/resume file.
> Here is a reproducer but I can't test whether we can trigger a real deadlock.
> 
> ----------------------------------------
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <sys/ioctl.h>
> #include <linux/loop.h>
> #include <sys/sendfile.h>
> 
> int main(int argc, char *argv[])
> {
> 	const int file_fd = open("testfile", O_RDWR | O_CREAT, 0600);
> 	ftruncate(file_fd, 1048576);
> 	char filename[128] = { };
> 	const int loop_num = ioctl(open("/dev/loop-control", 3), LOOP_CTL_GET_FREE, 0);
> 	snprintf(filename, sizeof(filename) - 1, "/dev/loop%d", loop_num);
> 	const int loop_fd_1 = open(filename, O_RDWR);
> 	ioctl(loop_fd_1, LOOP_SET_FD, file_fd);
> 	const int loop_fd_2 = open(filename, O_RDWR);
> 	ioctl(loop_fd_1, LOOP_CLR_FD, 0);
> 	const int sysfs_fd = open("/sys/power/resume", O_RDWR);
> 	sendfile(file_fd, sysfs_fd, 0, 1048576);
> 	sendfile(loop_fd_2, file_fd, 0, 1048576);
> 	write(sysfs_fd, "700", 3);
> 	close(loop_fd_2);
> 	close(loop_fd_1); // Lockdep complains.
> 	close(file_fd);
> 	return 0;
> }
> ----------------------------------------

Thanks for the reproducer. I will try to simplify it even more and figure
out whether there is a real deadlock potential in the lockdep complaint or
not...

> > b) Unless we have a realistic plan of moving *all* blk_mq_freeze_queue()
> > calls from under open_mutex in loop driver, moving stuff "where possible"
> > from under open_mutex is IMO just muddying water.
> 
> As far as I know, only lo_open() and lo_release() are functions which are
> called with disk->open_mutex held in loop driver. And only lo_release() calls
> blk_mq_freeze_queue() with disk->open_mutex held.
> 
> blk_mq_freeze_queue() from __loop_clr_fd() from lo_release() (I mean, autoclear
> part) can be done without disk->open_mutex held if we call __loop_clr_fd() from
> task work context. And I think blk_mq_freeze_queue() from lo_release() (I mean,
> "if (lo->lo_state == Lo_bound) { }" part) can be done in the same manner.
> 
> Therefore, I think this is not "partial move" but "complete move".

OK, fair point.

> >>>                               I mean using task work in
> >>> loop_schedule_rundown() makes a lot of sense because the loop
> >>>
> >>> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
> >>>
> >>> will not fail because of dangling work in the workqueue after umount ->
> >>> __loop_clr_fd().
> >>
> >> Using task work from lo_release() is for handling close() => umount() sequence.
> >> Using task work in lo_open() is for handling close() => open() sequence.
> >> The mount in
> >>
> >>   while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
> >>
> >> fails unless lo_open() waits for __loop_clr_fd() to complete.
> > 
> > Why? If we use task work, we are guaranteed the loop device is cleaned up
> > before umount returns unless some other process also has the loop device
> > open.
> 
> Since systemd-udevd opens this loop device asynchronously,
> __loop_clr_fd() from lo_release() is called by not mount or umount but
> systemd-udevd. This means that we are not guaranteed that the loop device
> is cleaned up before umount returns.

OK, understood. Thanks for explanation.

...
> >>>                           What your waiting in lo_open() achieves is only
> >>> that if __loop_clr_fd() from systemd-udevd happens to run at the same time
> >>> as lo_open() from mount, then we won't see EBUSY.
> >>
> >> My waiting in lo_open() is to fix a regression that
> >>
> >>   while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
> >>
> >> started to fail.
> > 
> > OK, so you claim this loop fails even with using task work in __loop_clr_fd(),
> > correct?
> 
> Correct. If we run this loop with
> https://lkml.kernel.org/r/9eff2034-2f32-54a3-e476-d0f609ab49c0@xxxxxxxxxxxxxxxxxxx ,
> we get errors like below.
> 
> ----------------------------------------
> root@fuzz:/mnt# while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
> mount: /mnt/isofs: can't read superblock on /dev/loop0.
> umount: isofs/: not mounted.

OK, so I was still wondering why this happens and now after poking a bit
more into util-linux I think I understand. As you say, someone
(systemd-udev in our case) has /dev/loop0 open when umount isofs/ runs. So
/dev/loop0 stays alive after umount. Then mount runs, sees /dev/loop0 is
still attached to isofs.iso and wants to reuse it for mount but before it
gets to opening the loop device in mnt_context_setup_loopdev(), the loop
device is already cleaned up. Open still succeeds but because backing file
is detached, by the time mount(2) tries to do any IO, it fails.

I don't think we can fully solve this race in the kernel and IMO it is
futile to try that - depending on when exactly systemd-udev decides to
close /dev/loop0 and how exactly mount decides to implement its loop device
reuse, different strategies will / will not work. But there is one subtlety
we need to solve - when __loop_clr_fd() is run outside of disk->open_mutex,
it can happen that next lo_open() rus before __loop_clr_fd(). Thus we can
hand away a file descriptor to a loop device that is half torn-down and
will be cleaned up in uncontrollable point in the future which can lead to
interesting consequences when the device is used at that time as well.

Perhaps we can fix these problems by checking lo_refcnt in __loop_clr_fd()
once we grab lo_mutex and thus make sure the device still should be
destroyed?

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR