On 10/31/24 7:54 AM, Jens Axboe wrote: > On 10/31/24 5:20 AM, Peter Mann wrote: >> Hello, >> >> it appears that there is a high probability of a deadlock occuring when performing fsfreeze on a filesystem which is currently performing multiple io_uring O_DIRECT writes. >> >> Steps to reproduce: >> 1. Mount xfs or ext4 filesystem on /mnt >> >> 2. Start writing to the filesystem. Must use io_uring, direct io and iodepth>1 to reproduce: >> fio --ioengine=io_uring --direct=1 --bs=4k --size=100M --rw=randwrite --loops=100000 --iodepth=32 --name=test --filename=/mnt/fio_test >> >> 3. Run this in another shell. For me it deadlocks almost immediately: >> while true; do fsfreeze -f /mnt/; echo froze; fsfreeze -u /mnt/; echo unfroze; done >> >> 4. Fsfreeze and all tasks attempting to write /mnt get stuck: >> At this point all stuck processes cannot be killed by SIGKILL and they are stuck in uninterruptible sleep. >> If you try 'touch /mnt/a' for example, the new process gets stuck in the exact same way as well. >> >> This gets printed when running 6.11.4 with some debug options enabled: >> [ 539.586122] Showing all locks held in the system: >> [ 539.612972] 1 lock held by khungtaskd/35: >> [ 539.626204] #0: ffffffffb3b1c100 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x32/0x1e0 >> [ 539.640561] 1 lock held by dmesg/640: >> [ 539.654282] #0: ffff9fd541a8e0e0 (&user->lock){+.+.}-{3:3}, at: devkmsg_read+0x74/0x2d0 >> [ 539.669220] 2 locks held by fio/647: >> [ 539.684253] #0: ffff9fd54fe720b0 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x5c2/0x820 >> [ 539.699565] #1: ffff9fd541a8d450 (sb_writers#15){++++}-{0:0}, at: io_issue_sqe+0x9c/0x780 >> [ 539.715587] 2 locks held by fio/648: >> [ 539.732293] #0: ffff9fd54fe710b0 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x5c2/0x820 >> [ 539.749121] #1: ffff9fd541a8d450 (sb_writers#15){++++}-{0:0}, at: io_issue_sqe+0x9c/0x780 >> [ 539.765484] 2 locks held by fio/649: >> [ 539.781483] #0: ffff9fd541a8f0b0 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x5c2/0x820 >> [ 539.798785] #1: ffff9fd541a8d450 (sb_writers#15){++++}-{0:0}, at: io_issue_sqe+0x9c/0x780 >> [ 539.815466] 2 locks held by fio/650: >> [ 539.831966] #0: ffff9fd54fe740b0 (&ctx->uring_lock){+.+.}-{3:3}, at: __do_sys_io_uring_enter+0x5c2/0x820 >> [ 539.849527] #1: ffff9fd541a8d450 (sb_writers#15){++++}-{0:0}, at: io_issue_sqe+0x9c/0x780 >> [ 539.867469] 1 lock held by fsfreeze/696: >> [ 539.884565] #0: ffff9fd541a8d450 (sb_writers#15){++++}-{0:0}, at: freeze_super+0x20a/0x600 >> >> I reproduced this bug on nvme, sata ssd, virtio disks and lvm logical volumes. >> It deadlocks on all kernels that I tried (all on amd64): >> 6.12-rc5 (compiled from kernel.org) >> 6.11.4 (compiled from kernel.org) >> 6.10.11-1~bpo12+1 (debian) >> 6.1.0-23 (debian) >> 5.14.0-427.40.1.el9_4.x86_64 (rocky linux) >> 5.10.0-33-amd64 (debian) >> >> I tried to compile some older ones to check if it's a regression, but >> those either didn't compile or didn't boot in my VM, sorry about that. >> If you have anything specific for me to try, I'm happy to help. >> >> Found this issue as well, so it seems like it's not just me: >> https://gitlab.com/qemu-project/qemu/-/issues/881 >> Note that mariadb 10.6 adds support for io_uring, and that proxmox backups perform fsfreeze in the guest VM. >> >> Originally I discovered this after a scheduled lvm snapshot of mariadb >> got stuck. It appears that lvm calls dm_suspend, which then calls >> freeze_super, so it looks like the same bug to me. I discovered the >> simpler fsfreeze/fio reproduction method when I tried to find a >> workaround. > > Thanks for the report! I'm pretty sure this is due to the freezing not > allowing task_work to run, which prevents completions from being run. > Hence you run into a situation where freezing isn't running the very IO > completions that will free up the rwsem, with IO issue being stuck on > the freeze having started. > > I'll take a look... Can you try the below? Probably easiest on 6.12-rc5 as you already tested that and should apply directly. diff --git a/io_uring/rw.c b/io_uring/rw.c index 30448f343c7f..ea057ec4365f 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -1013,6 +1013,18 @@ int io_read_mshot(struct io_kiocb *req, unsigned int issue_flags) return IOU_OK; } +static bool io_kiocb_start_write(struct io_kiocb *req, struct kiocb *kiocb) +{ + if (!(req->flags & REQ_F_ISREG)) + return true; + if (!(kiocb->ki_flags & IOCB_NOWAIT)) { + kiocb_start_write(kiocb); + return true; + } + + return sb_start_write_trylock(file_inode(kiocb->ki_filp)->i_sb); +} + int io_write(struct io_kiocb *req, unsigned int issue_flags) { bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; @@ -1050,8 +1062,8 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret)) return ret; - if (req->flags & REQ_F_ISREG) - kiocb_start_write(kiocb); + if (unlikely(!io_kiocb_start_write(req, kiocb))) + return -EAGAIN; kiocb->ki_flags |= IOCB_WRITE; if (likely(req->file->f_op->write_iter)) -- Jens Axboe