Re: [PATCH v2 2/2] loop: use task_work for autoclear operation

Jan Kara <jack@xxxxxxx> · Thu, 13 Jan 2022 11:44:24 +0100

On Wed 12-01-22 14:16:15, Jan Kara wrote:
> On Tue 11-01-22 00:08:56, Tetsuo Handa wrote:
> > On 2022/01/10 22:42, Jan Kara wrote:
> > > a) We didn't fully establish a real deadlock scenario from the lockdep
> > > report, did we? The lockdep report involved suspend locks, some locks on
> > > accessing files in /proc etc. and it was not clear whether it all reflects
> > > a real deadlock possibility or just a fact that lockdep tracking is rather
> > > coarse-grained at times. Now lockdep report is unpleasant and loop device
> > > locking was ugly anyway so your async change made sense but once lockdep is
> > > silenced we should really establish whether there is real deadlock and more
> > > work is needed or not.
> > 
> > Not /proc files but /sys/power/resume file.
> > Here is a reproducer but I can't test whether we can trigger a real deadlock.
> > 
> > ----------------------------------------
> > #include <stdio.h>
> > #include <sys/types.h>
> > #include <sys/stat.h>
> > #include <fcntl.h>
> > #include <unistd.h>
> > #include <sys/ioctl.h>
> > #include <linux/loop.h>
> > #include <sys/sendfile.h>
> > 
> > int main(int argc, char *argv[])
> > {
> > 	const int file_fd = open("testfile", O_RDWR | O_CREAT, 0600);
> > 	ftruncate(file_fd, 1048576);
> > 	char filename[128] = { };
> > 	const int loop_num = ioctl(open("/dev/loop-control", 3), LOOP_CTL_GET_FREE, 0);
> > 	snprintf(filename, sizeof(filename) - 1, "/dev/loop%d", loop_num);
> > 	const int loop_fd_1 = open(filename, O_RDWR);
> > 	ioctl(loop_fd_1, LOOP_SET_FD, file_fd);
> > 	const int loop_fd_2 = open(filename, O_RDWR);
> > 	ioctl(loop_fd_1, LOOP_CLR_FD, 0);
> > 	const int sysfs_fd = open("/sys/power/resume", O_RDWR);
> > 	sendfile(file_fd, sysfs_fd, 0, 1048576);
> > 	sendfile(loop_fd_2, file_fd, 0, 1048576);
> > 	write(sysfs_fd, "700", 3);
> > 	close(loop_fd_2);
> > 	close(loop_fd_1); // Lockdep complains.
> > 	close(file_fd);
> > 	return 0;
> > }
> > ----------------------------------------
> 
> Thanks for the reproducer. I will try to simplify it even more and figure
> out whether there is a real deadlock potential in the lockdep complaint or
> not...

OK, so I think I understand the lockdep complaint better. Lockdep
essentially complains about the following scenario:

blkdev_put()
  lock disk->open_mutex
  lo_release
    __loop_clr_fd()
        |
        | wait for IO to complete
        v
loop worker
  write to backing file
    sb_start_write(file)
        |
        | wait for fs with backing file to unfreeze
        v
process holding fs frozen
  freeze_super()
        |
        | wait for remaining writers on the fs so that fs can be frozen
        v
sendfile()
  sb_start_write()
  do_splice_direct()
        |
        | blocked on read from /sys/power/resume, waiting for kernfs file
	| lock
        v
write() "/dev/loop0" (in whatever form) to /sys/power/resume
  calls blkdev_get_by_dev("/dev/loop0")
    lock disk->open_mutex => deadlock

So there are three other ingredients to this locking loop besides loop device
behavior:
1) sysfs files which serialize reads & writes
2) sendfile which holds freeze protection while reading data to write
3) "resume" file behavior opening bdev from write to sysfs file

We cannot sensibly do anything about 1) AFAICT. You cannot get a coherent
view of a sysfs file while it is changing.

We could actually change 2) (to only hold freeze protection while splicing
from pipe) but that will fix the problem only for sendfile(2). E.g.
splice(2) may also block waiting for data to become available in the pipe
while holding freeze protection. Changing that would require some surgery
in our splice infrastructure and at this point I'm not sure whether we
would not introduce other locking problems due to pipe_lock lock ordering.

For 3), we could consider stuff like not resuming from a loop device or
postponing resume to a workqueue but it all looks ugly.

Maybe the most disputable thing in this locking chain seems to be splicing
from sysfs files. That does not seem terribly useful and due to special
locking and behavior of sysfs files it allows for creating interesting lock
dependencies. OTOH maybe there is someone out there who (possibly
inadvertedly through some library) ends up using splice on sysfs files so
chances for userspace breakage, if we disable splice for sysfs, would be
non-negligible. Hum, tough.

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR