On 2022/03/15 17:44, Christoph Hellwig wrote: > On Mon, Mar 14, 2022 at 04:23:18PM +0100, Jan Kara wrote: >> Honestly, the anon inode trick makes the code pretty much unreadable. As >> far as I remember Christoph was not pricipially against using task_work. He >> just wanted the task_work API to be improved so that it is easier to use. > > This whole patch is awful. And no, I don't think task_work_add really has > any business here. We need to properly unwind the mess instead of piling > things higher and higher. How do you plan to unwind the mess? On 2022/03/15 0:23, Jan Kara wrote: > On Mon 14-03-22 00:15:22, Tetsuo Handa wrote: >> syzbot is reporting circular locking problem at __loop_clr_fd() [1], for >> commit 87579e9b7d8dc36e ("loop: use worker per cgroup instead of kworker") >> is calling destroy_workqueue() with disk->open_mutex held. >> >> Commit 322c4293ecc58110 ("loop: make autoclear operation asynchronous") >> tried to fix this problem by deferring __loop_clr_fd() from lo_release() >> to a WQ context, but that commit was reverted because there are userspace >> programs which depend on that __loop_clr_fd() completes by the moment >> close() returns to user mode. >> >> This time, we try to fix this problem by deferring __loop_clr_fd() from >> lo_release() to a task work context. Since task_work_add() is not exported >> but Christoph does not want to export it, this patch uses anonymous inode >> interface as a wrapper for calling task_work_add() via fput(). >> >> Although Jan has found two bugs in /bin/mount where one of these was fixed >> in util-linux package [2], I found that not holding lo->lo_mutex from >> lo_open() causes occasional I/O error [3] (due to >> >> if (lo->lo_state != Lo_bound) >> return BLK_STS_IOERR; >> >> check in loop_queue_rq()) when systemd-udevd opens a loop device and >> reads from it before loop_configure() updates lo->lo_state to Lo_bound. > > Thanks for detailed explanation here. Finally, I can see what is the > problem with the open path. A few questions here: > > 1) Can you please remind me why do we need to also remove the lo_mutex from > lo_open()? The original lockdep problem was only for __loop_clr_fd(), > wasn't it? Right, but from locking dependency chain perspective, holding lo->lo_mutex from lo_open()/lo_release() is treated as if lo_open()/lo_release() waits for flush of I/O with lo->lo_mutex held, even if it is not lo_open()/lo_release() which actually flushes I/O with lo->lo_mutex held. We need to avoid holding lo->lo_mutex from lo_open()/lo_release() while avoiding problems caused by not waiting for loop_configure()/__loop_clr_fd(). > > 2) Cannot we just move the generation of DISK_EVENT_MEDIA_CHANGE event > after the moment the loop device is configured? That way systemd won't > start probing the new loop device until it is fully configured. Do you mean --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1024,7 +1024,6 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, goto out_unlock; } - disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE); set_disk_ro(lo->lo_disk, (lo->lo_flags & LO_FLAGS_READ_ONLY) != 0); INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn); @@ -1066,6 +1065,7 @@ static int loop_configure(struct loop_device *lo, fmode_t mode, wmb(); lo->lo_state = Lo_bound; + disk_force_media_change(lo->lo_disk, DISK_EVENT_MEDIA_CHANGE); if (part_shift) lo->lo_flags |= LO_FLAGS_PARTSCAN; partscan = lo->lo_flags & LO_FLAGS_PARTSCAN; ? Maybe possible, but looks strange. Such change will defer only open()->read() sequence triggered via kobject_uevent(KOBJ_CHANGE). My patch defers all open() from userspace, like we did until 5.11. > > Because the less hacks with task_work we have the better. > > Also I don't think the deadlock is really fixed when you wait for > __loop_clr_fd() during open. It is maybe just hidden from lockdep. Since lo_post_open() is called with no locks held, how can waiting for __loop_clr_fd() after lo_open() can cause deadlock? The reason to choose task_work context is to synchronize without any other locks held. Until 5.11, lo_open() and loop_configure() were serialized using loop_ctl_mutex. In 5.12, lo_open() and loop_configure() started using lo->lo_mutex, and serialization among multiple loop devices was fixed in 5.13 using loop_validate_mutex. Therefore, basically we had been waiting for loop_configure() and __loop_clr_fd() during open. My patch just moves waiting for loop_configure() and __loop_clr_fd() in open to outside of system_transition_mutex/1 and disk->open_mutex. > But the > fundamental problem of the cycle syzbot constructed is that flushing of IO > to a loop device may end up depending on open of that loop device to > finish. How can my patch depend on open of that loop device? It is task_work context. > So if pending __loop_clr_fd() is a problem for somebody (is it?), > we still need to let open finish and deal with waiting for __loop_clr_fd() > in other places where that actually causes problem (and make sure > Lo_rundown is either treated as Lo_unbound or wait for Lo_rundown to > transition to Lo_unbound). lo_post_open() lets open finish and deal with waiting for __loop_clr_fd().