Re: [PATCH v2 2/2] loop: use task_work for autoclear operation

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Tue, 11 Jan 2022 00:08:56 +0900

On 2022/01/10 22:42, Jan Kara wrote:
> a) We didn't fully establish a real deadlock scenario from the lockdep
> report, did we? The lockdep report involved suspend locks, some locks on
> accessing files in /proc etc. and it was not clear whether it all reflects
> a real deadlock possibility or just a fact that lockdep tracking is rather
> coarse-grained at times. Now lockdep report is unpleasant and loop device
> locking was ugly anyway so your async change made sense but once lockdep is
> silenced we should really establish whether there is real deadlock and more
> work is needed or not.

Not /proc files but /sys/power/resume file.
Here is a reproducer but I can't test whether we can trigger a real deadlock.

----------------------------------------
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <linux/loop.h>
#include <sys/sendfile.h>

int main(int argc, char *argv[])
{
	const int file_fd = open("testfile", O_RDWR | O_CREAT, 0600);
	ftruncate(file_fd, 1048576);
	char filename[128] = { };
	const int loop_num = ioctl(open("/dev/loop-control", 3), LOOP_CTL_GET_FREE, 0);
	snprintf(filename, sizeof(filename) - 1, "/dev/loop%d", loop_num);
	const int loop_fd_1 = open(filename, O_RDWR);
	ioctl(loop_fd_1, LOOP_SET_FD, file_fd);
	const int loop_fd_2 = open(filename, O_RDWR);
	ioctl(loop_fd_1, LOOP_CLR_FD, 0);
	const int sysfs_fd = open("/sys/power/resume", O_RDWR);
	sendfile(file_fd, sysfs_fd, 0, 1048576);
	sendfile(loop_fd_2, file_fd, 0, 1048576);
	write(sysfs_fd, "700", 3);
	close(loop_fd_2);
	close(loop_fd_1); // Lockdep complains.
	close(file_fd);
	return 0;
}
----------------------------------------

Since many locks are held under disk->open_mutex, disk->open_mutex is getting
like a dependency aggregation hub.

----------------------------------------
ffffffff852c9920 OPS:     963 FD:  315 BD:    4 +.+.: &disk->open_mutex
 -> [ffffffffa01c0a48] sd_ref_mutex
 -> [ffffffff831bda60] fs_reclaim
 -> [ffffffff8529b260] &n->list_lock
 -> [ffffffff83330988] depot_lock
 -> [ffffffff8312bb08] tk_core.seq.seqcount
 -> [ffffffff83d85000] &rq->__lock
 -> [ffffffff852ca120] &obj_hash[i].lock
 -> [ffffffff852c8e60] &hctx->lock
 -> [ffffffff852c8e00] &x->wait#20
 -> [ffffffff85283d60] &base->lock
 -> [ffffffff85283ce0] (&timer.timer)
 -> [ffff88811aa44b60] lock#2
 -> [ffffffff85291860] &____s->seqcount
 -> [ffffffff852c8c00] &q->sysfs_dir_lock
 -> [ffffffff852c8160] &bdev->bd_size_lock
 -> [ffffffff831bc378] free_vmap_area_lock
 -> [ffffffff831bc318] vmap_area_lock
 -> [ffffffff852a8c00] &xa->xa_lock#3
 -> [ffff88811aa44390] lock#6
 -> [ffffffff852a8bf0] &mapping->private_lock
 -> [ffffffff852c9f20] &dd->lock
 -> [ffffffff8528fb40] &folio_wait_table[i]
 -> [ffffffff82ef6c18] (console_sem).lock
 -> [ffffffff8330b708] &sb->s_type->i_lock_key#3
 -> [ffffffff852a5720] &s->s_inode_list_lock
 -> [ffffffff831a9508] pcpu_alloc_mutex
 -> [ffffffff8544e5c0] &x->wait#9
 -> [ffffffff8543b160] &k->list_lock
 -> [ffff88811aa459c0] lock#3
 -> [ffffffff852adbc0] &root->kernfs_rwsem
 -> [ffffffff83356b50] bus_type_sem
 -> [ffffffff8321e358] sysfs_symlink_target_lock
 -> [ffffffff8544e020] &dev->power.lock
 -> [ffffffff833d75e8] dpm_list_mtx
 -> [ffffffff833d4b78] req_lock
 -> [ffffffff83d80860] &p->pi_lock
 -> [ffffffff8544e420] &x->wait#10
 -> [ffffffff8543b140] &k->k_lock
 -> [ffffffff852c9960] subsys mutex#47
 -> [ffffffff852c9980] &xa->xa_lock#5
 -> [ffffffff82e14698] inode_hash_lock
 -> [ffffffff831bc538] purge_vmap_area_lock
 -> [ffffffff852900a0] &lruvec->lru_lock
 -> [ffffffff83328c98] pool_lock
 -> [ffffffff834052c8] sr_ref_mutex
 -> [ffffffff852c9a20] &ev->block_mutex
 -> [ffffffff852c9a00] &ev->lock
 -> [ffffffff831e6ed8] sb_lock
 -> [ffffffff8544fae0] &cd->lock
 -> [ffffffff82e14818] bdev_lock
 -> [ffffffffa03d10c0] &lo->lo_mutex
 -> [ffffffff83d86320] &lock->wait_lock
 -> [ffffffff83195368] lock#5
 -> [ffffffff85290d60] &wb->list_lock
----------------------------------------

> 
> b) Unless we have a realistic plan of moving *all* blk_mq_freeze_queue()
> calls from under open_mutex in loop driver, moving stuff "where possible"
> from under open_mutex is IMO just muddying water.

As far as I know, only lo_open() and lo_release() are functions which are
called with disk->open_mutex held in loop driver. And only lo_release() calls
blk_mq_freeze_queue() with disk->open_mutex held.

blk_mq_freeze_queue() from __loop_clr_fd() from lo_release() (I mean, autoclear
part) can be done without disk->open_mutex held if we call __loop_clr_fd() from
task work context. And I think blk_mq_freeze_queue() from lo_release() (I mean,
"if (lo->lo_state == Lo_bound) { }" part) can be done in the same manner.

Therefore, I think this is not "partial move" but "complete move".

> 
>>>                               I mean using task work in
>>> loop_schedule_rundown() makes a lot of sense because the loop
>>>
>>> while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
>>>
>>> will not fail because of dangling work in the workqueue after umount ->
>>> __loop_clr_fd().
>>
>> Using task work from lo_release() is for handling close() => umount() sequence.
>> Using task work in lo_open() is for handling close() => open() sequence.
>> The mount in
>>
>>   while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
>>
>> fails unless lo_open() waits for __loop_clr_fd() to complete.
> 
> Why? If we use task work, we are guaranteed the loop device is cleaned up
> before umount returns unless some other process also has the loop device
> open.

Since systemd-udevd opens this loop device asynchronously, __loop_clr_fd() from
lo_release() is called by not mount or umount but systemd-udevd. This means that
we are not guaranteed that the loop device is cleaned up before umount returns.

> 
>>>                  But when other processes like systemd-udevd start to mess
>>> with the loop device, then you have no control whether the following mount
>>> will see isofs.iso busy or not - it depends on when systemd-udevd decides
>>> to close the loop device.
>>
>> Right. But regarding that part the main problem is that the script is not
>> checking for errors. What we are expected to do is to restore barrier
>> which existed before commit 322c4293ecc58110 ("loop: make autoclear
>> operation asynchronous").
> 
> OK, can you explain what you exactly mean by the barrier?

/bin/mount opens the loop device and reads from that loop device in order to
read superblock for guessing/verifying filesystem type of the backing file.

disk->open_mutex was blocking open() of the loop device until autoclear via
close() from systemd-udevd completes. My patch is intended to restore/mimic
this behavior without holding disk->open_mutex.

>                                                           Because it my
> understanding your patch only makes one race somewhat less likely.

Yes, this barrier is for making one race less likely.
We need to try to avoid hitting this race because /bin/mount fails when we
hit this race while reading superblock of the backing file.

> 
>>
>>>                           What your waiting in lo_open() achieves is only
>>> that if __loop_clr_fd() from systemd-udevd happens to run at the same time
>>> as lo_open() from mount, then we won't see EBUSY.
>>
>> My waiting in lo_open() is to fix a regression that
>>
>>   while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
>>
>> started to fail.
> 
> OK, so you claim this loop fails even with using task work in __loop_clr_fd(),
> correct?

Correct. If we run this loop with
https://lkml.kernel.org/r/9eff2034-2f32-54a3-e476-d0f609ab49c0@xxxxxxxxxxxxxxxxxxx ,
we get errors like below.

----------------------------------------
root@fuzz:/mnt# while :; do mount -o loop,ro isofs.iso isofs/; umount isofs/; done
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
mount: /mnt/isofs: can't read superblock on /dev/loop0.
umount: isofs/: not mounted.
----------------------------------------

----------------------------------------
[  172.907151] loop0: detected capacity change from 0 to 1992704
[  173.034450] LoadPin: kernel-module pinning-ignored obj="/usr/lib/modules/5.16.0-rc8-next-20220107+/kernel/fs/isofs/isofs.ko" pid=6124 cmdline="/sbin/modprobe -q -- fs-iso9660"
[  173.130665] ISO 9660 Extensions: Microsoft Joliet Level 3
[  173.132227] ISO 9660 Extensions: RRIP_1991A
[  173.271904] loop0: detected capacity change from 0 to 1992704
[  173.303460] ISO 9660 Extensions: Microsoft Joliet Level 3
[  173.304261] ISO 9660 Extensions: RRIP_1991A
[  173.431772] ISO 9660 Extensions: Microsoft Joliet Level 3
[  173.435583] ISO 9660 Extensions: RRIP_1991A
[  173.559952] ISO 9660 Extensions: Microsoft Joliet Level 3
[  173.561014] ISO 9660 Extensions: RRIP_1991A
[  173.673018] I/O error, dev loop0, sector 1992576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  173.683062] I/O error, dev loop0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[  173.685547] EXT4-fs (loop0): unable to read superblock
[  173.763214] loop0: detected capacity change from 0 to 1992704
[  173.792790] ISO 9660 Extensions: Microsoft Joliet Level 3
[  173.802959] ISO 9660 Extensions: RRIP_1991A
[  173.918140] ISO 9660 Extensions: Microsoft Joliet Level 3
[  173.920018] ISO 9660 Extensions: RRIP_1991A
[  173.991678] loop0: detected capacity change from 0 to 1992704
[  174.019262] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.020914] ISO 9660 Extensions: RRIP_1991A
[  174.111092] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.111862] ISO 9660 Extensions: RRIP_1991A
[  174.210594] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.211624] ISO 9660 Extensions: RRIP_1991A
[  174.310668] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.312334] ISO 9660 Extensions: RRIP_1991A
[  174.424918] loop0: detected capacity change from 0 to 1992704
[  174.455070] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.455944] ISO 9660 Extensions: RRIP_1991A
[  174.540379] loop0: detected capacity change from 0 to 1992704
[  174.576673] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.578195] ISO 9660 Extensions: RRIP_1991A
[  174.655124] I/O error, dev loop0, sector 1992576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  174.665039] I/O error, dev loop0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[  174.667435] EXT4-fs (loop0): unable to read superblock
[  174.745809] loop0: detected capacity change from 0 to 1992704
[  174.783917] ISO 9660 Extensions: Microsoft Joliet Level 3
[  174.789040] ISO 9660 Extensions: RRIP_1991A
[  174.888963] I/O error, dev loop0, sector 1992576 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  174.896654] I/O error, dev loop0, sector 2 op 0x0:(READ) flags 0x1000 phys_seg 1 prio class 0
[  174.898893] EXT4-fs (loop0): unable to read superblock
----------------------------------------

>          And does it fail even without systemd-udev?

I don't have an environment without systemd-udevd.
But probably it won't fail if there is nobody who opens asynchronously.

> 
>>>                                                   But IMO that is not worth
>>> the complexity in lo_open() because if systemd-udevd happens to close the
>>> loop device a millisecond later, you will get EBUSY anyway (and you would
>>> get it even in the past) Or am I missing something?
>>
>> Excuse me, but lo_open() does not return EBUSY?
>> What operation are you talking about?
> 
> I didn't mean EBUSY from lo_open() but EBUSY from LOOP_SET_FD ioctl(2). But
> maybe I misunderstood the failure. Where exactly does mount(1) fail? Because
> with the options you mention it should do something like:
>   ioctl(LOOP_CTL_GET_FREE) -> get free loop dev
>   ioctl(LOOP_SET_FD) -> bind isofs.iso to free loop dev
>   mount(loop_dev, isofs/)
> 
> and I though it is the second syscall that fails.

/bin/mount is failing at read() of superblock after open() of the loop device.