Hi Josef, Kuai, Josef, thank you for attaching your patch. No worries about being on vacation, I hope you enjoyed your time off. Josef, I built your patch ontop of 5.18-rc6 with no other patches applied, and ran the testcase in my original message. After 3x loops, a hang occurred, and we see the usual -32 error: May 16 03:38:35 focal-nbd kernel: block nbd15: NBD_DISCONNECT May 16 03:38:35 focal-nbd kernel: block nbd15: Send disconnect failed -32 The hang lasted 30 seconds, no doubt caused by the "30 * HZ" timeout in your patch, and things started moving forward again: May 16 03:39:05 focal-nbd kernel: block nbd15: Connection timed out, retrying (0/1 alive) May 16 03:39:05 focal-nbd kernel: block nbd15: Connection timed out, retrying (0/1 alive) May 16 03:39:05 focal-nbd kernel: blk_print_req_error: 128 callbacks suppressed May 16 03:39:05 focal-nbd kernel: I/O error, dev nbd15, sector 1023488 op 0x0:(READ) flags 0x80700 phys_seg 14 prio class 0 May 16 03:39:05 focal-nbd kernel: I/O error, dev nbd15, sector 1023608 op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 0 May 16 03:39:05 focal-nbd kernel: block nbd15: Device being setup by another task Note the timestamp increment of 30s. There were a whole host of I/O errors, and after a few more loops, the hang occurred again, again lasting for 30 seconds, and then doing a few more loops before getting stuck again. Pastebin of journalctl: https://paste.ubuntu.com/p/Cx6MBC8Vgj/ Unfortunately, your patch doesn't quite solve the issue. Kuai, I tested your suspicions by building Josef's patch ontop of 5.18-rc6 with your below patch applied: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed https://lists.debian.org/nbd/2022/04/msg00212.html. The behaviour was different this time from Josef's patch alone. On the very second iteration of the loop, I got a bunch of I/O errors, and the nbd subsystem hung, and did not recover. I started getting stuck request messages, and the usual hung task timeout oops messages. Pastebin of journalctl here: https://paste.ubuntu.com/p/C9rjckrWtp/ I went back and did some more testing of Kuai's two commits: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed https://lists.debian.org/nbd/2022/04/msg00212.html nbd: fix io hung while disconnecting device https://lists.debian.org/nbd/2022/04/msg00207.html I left the testcase running for about 20 minutes, and it never hung. It did get a bit racey from time to time trying to get a write lock for the qcow image, where the disconnect completed after the call to mkfs.ext4 started, but simply saying "y" let the loop run for another 5 minutes before the race occurred again. Formatting 'foo.img', fmt=qcow2 size=524288000 cluster_size=65536 lazy_refcounts=off refcount_bits=16 qemu-img: foo.img: Failed to get "write" lock Is another process using the image [foo.img]? /dev/nbd15 disconnected mke2fs 1.45.5 (07-Jan-2020) /dev/nbd15 contains a ext4 file system labelled 'root' created on Mon May 16 05:23:01 2022 Proceed anyway? (y,N) Through my whole time testing Kuai's fixes, I never saw a hang. The behaviour seen is the same as the workaround of preventing systemd from watching nbd devices with inotify. I think we should go with Kuai's patches. So for Kuai's two patches: Tested-by: Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx> Thanks, Matthew On Sat, May 14, 2022 at 3:39 PM yukuai (C) <yukuai3@xxxxxxxxxx> wrote: > > 在 2022/05/13 21:13, Josef Bacik 写道: > > On Fri, May 13, 2022 at 02:56:18PM +1200, Matthew Ruffell wrote: > >> Hi Josef, > >> > >> Just a friendly ping, I am more than happy to test a patch, if you send it > >> inline in the email, since the pastebin you used expired after 1 day, and I > >> couldn't access it. > >> > >> I came across and tested Yu Kuai's patches [1][2] which are for the same issue, > >> and they indeed fix the hang. Thank you Yu. > >> > >> [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > >> https://lists.debian.org/nbd/2022/04/msg00212.html > >> > >> [2] nbd: fix io hung while disconnecting device > >> https://lists.debian.org/nbd/2022/04/msg00207.html > >> > >> I am also happy to test any patches to fix the I/O errors. > >> > > > > Sorry, you caught me on vacation before and I forgot to reply. Here's part one > > of the patch I wanted you to try which fixes the io hung part. Thanks, > > > > Josef > > > > > >>From 0a6123520380cb84de8ccefcccc5f112bce5efb6 Mon Sep 17 00:00:00 2001 > > Message-Id: <0a6123520380cb84de8ccefcccc5f112bce5efb6.1652447517.git.josef@xxxxxxxxxxxxxx> > > From: Josef Bacik <josef@xxxxxxxxxxxxxx> > > Date: Sat, 23 Apr 2022 23:51:23 -0400 > > Subject: [PATCH] timeout thing > > > > --- > > drivers/block/nbd.c | 5 ++++- > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c > > index 526389351784..ab365c0e9c04 100644 > > --- a/drivers/block/nbd.c > > +++ b/drivers/block/nbd.c > > @@ -1314,7 +1314,10 @@ static void nbd_config_put(struct nbd_device *nbd) > > kfree(nbd->config); > > nbd->config = NULL; > > > > - nbd->tag_set.timeout = 0; > > + /* Reset our timeout to something sane. */ > > + nbd->tag_set.timeout = 30 * HZ; > > + blk_queue_rq_timeout(nbd->disk->queue, 30 * HZ); > > + > > nbd->disk->queue->limits.discard_granularity = 0; > > nbd->disk->queue->limits.discard_alignment = 0; > > blk_queue_max_discard_sectors(nbd->disk->queue, 0); > > > Hi, Josef > > This seems to try to fix the same problem that I described here: > > nbd: fix io hung while disconnecting device > https://lists.debian.org/nbd/2022/04/msg00207.html > > There are still some io that are stuck, which means the devcie is > probably still opened. Thus nbd_config_put() can't reach here. > I'm afraid this patch can't fix the io hung. > > Matthew, can you try a test with this patch together with my patch below > to comfirm my thought? > > nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > https://lists.debian.org/nbd/2022/04/msg00212.html. > > Thanks, > Kuai