Re: [PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx> · Mon, 16 May 2022 17:35:10 +1200

Hi Josef, Kuai,

Josef, thank you for attaching your patch. No worries about being on vacation,
I hope you enjoyed your time off.

Josef, I built your patch ontop of 5.18-rc6 with no other patches applied, and
ran the testcase in my original message. After 3x loops, a hang occurred, and
we see the usual -32 error:

May 16 03:38:35 focal-nbd kernel: block nbd15: NBD_DISCONNECT
May 16 03:38:35 focal-nbd kernel: block nbd15: Send disconnect failed -32

The hang lasted 30 seconds, no doubt caused by the "30 * HZ" timeout in your
patch, and things started moving forward again:

May 16 03:39:05 focal-nbd kernel: block nbd15: Connection timed out,
retrying (0/1 alive)
May 16 03:39:05 focal-nbd kernel: block nbd15: Connection timed out,
retrying (0/1 alive)
May 16 03:39:05 focal-nbd kernel: blk_print_req_error: 128 callbacks suppressed
May 16 03:39:05 focal-nbd kernel: I/O error, dev nbd15, sector 1023488
op 0x0:(READ) flags 0x80700 phys_seg 14 prio class 0
May 16 03:39:05 focal-nbd kernel: I/O error, dev nbd15, sector 1023608
op 0x0:(READ) flags 0x80700 phys_seg 16 prio class 0
May 16 03:39:05 focal-nbd kernel: block nbd15: Device being setup by
another task

Note the timestamp increment of 30s. There were a whole host of I/O errors,
and after a few more loops, the hang occurred again, again lasting for 30
seconds, and then doing a few more loops before getting stuck again.

Pastebin of journalctl: https://paste.ubuntu.com/p/Cx6MBC8Vgj/

Unfortunately, your patch doesn't quite solve the issue.

Kuai, I tested your suspicions by building Josef's patch ontop of 5.18-rc6 with
your below patch applied:

nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
https://lists.debian.org/nbd/2022/04/msg00212.html.

The behaviour was different this time from Josef's patch alone. On the very
second iteration of the loop, I got a bunch of I/O errors, and the nbd subsystem
hung, and did not recover. I started getting stuck request messages, and
the usual hung task timeout oops messages.

Pastebin of journalctl here: https://paste.ubuntu.com/p/C9rjckrWtp/

I went back and did some more testing of Kuai's two commits:

nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
https://lists.debian.org/nbd/2022/04/msg00212.html

nbd: fix io hung while disconnecting device
https://lists.debian.org/nbd/2022/04/msg00207.html

I left the testcase running for about 20 minutes, and it never hung. It did
get a bit racey from time to time trying to get a write lock for the qcow image,
where the disconnect completed after the call to mkfs.ext4 started, but simply
saying "y" let the loop run for another 5 minutes before the race occurred
again.

Formatting 'foo.img', fmt=qcow2 size=524288000 cluster_size=65536
lazy_refcounts=off refcount_bits=16
qemu-img: foo.img: Failed to get "write" lock
Is another process using the image [foo.img]?
/dev/nbd15 disconnected
mke2fs 1.45.5 (07-Jan-2020)
/dev/nbd15 contains a ext4 file system labelled 'root'
    created on Mon May 16 05:23:01 2022
Proceed anyway? (y,N)

Through my whole time testing Kuai's fixes, I never saw a hang. The behaviour
seen is the same as the workaround of preventing systemd from watching nbd
devices with inotify. I think we should go with Kuai's patches.

So for Kuai's two patches:

Tested-by: Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx>

Thanks,
Matthew

On Sat, May 14, 2022 at 3:39 PM yukuai (C) <yukuai3@xxxxxxxxxx> wrote:
>
> 在 2022/05/13 21:13, Josef Bacik 写道:
> > On Fri, May 13, 2022 at 02:56:18PM +1200, Matthew Ruffell wrote:
> >> Hi Josef,
> >>
> >> Just a friendly ping, I am more than happy to test a patch, if you send it
> >> inline in the email, since the pastebin you used expired after 1 day, and I
> >> couldn't access it.
> >>
> >> I came across and tested Yu Kuai's patches [1][2] which are for the same issue,
> >> and they indeed fix the hang. Thank you Yu.
> >>
> >> [1] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> >> https://lists.debian.org/nbd/2022/04/msg00212.html
> >>
> >> [2] nbd: fix io hung while disconnecting device
> >> https://lists.debian.org/nbd/2022/04/msg00207.html
> >>
> >> I am also happy to test any patches to fix the I/O errors.
> >>
> >
> > Sorry, you caught me on vacation before and I forgot to reply.  Here's part one
> > of the patch I wanted you to try which fixes the io hung part.  Thanks,
> >
> > Josef
> >
> >
> >>From 0a6123520380cb84de8ccefcccc5f112bce5efb6 Mon Sep 17 00:00:00 2001
> > Message-Id: <0a6123520380cb84de8ccefcccc5f112bce5efb6.1652447517.git.josef@xxxxxxxxxxxxxx>
> > From: Josef Bacik <josef@xxxxxxxxxxxxxx>
> > Date: Sat, 23 Apr 2022 23:51:23 -0400
> > Subject: [PATCH] timeout thing
> >
> > ---
> >   drivers/block/nbd.c | 5 ++++-
> >   1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
> > index 526389351784..ab365c0e9c04 100644
> > --- a/drivers/block/nbd.c
> > +++ b/drivers/block/nbd.c
> > @@ -1314,7 +1314,10 @@ static void nbd_config_put(struct nbd_device *nbd)
> >               kfree(nbd->config);
> >               nbd->config = NULL;
> >
> > -             nbd->tag_set.timeout = 0;
> > +             /* Reset our timeout to something sane. */
> > +             nbd->tag_set.timeout = 30 * HZ;
> > +             blk_queue_rq_timeout(nbd->disk->queue, 30 * HZ);
> > +
> >               nbd->disk->queue->limits.discard_granularity = 0;
> >               nbd->disk->queue->limits.discard_alignment = 0;
> >               blk_queue_max_discard_sectors(nbd->disk->queue, 0);
> >
> Hi, Josef
>
> This seems to try to fix the same problem that I described here:
>
> nbd: fix io hung while disconnecting device
> https://lists.debian.org/nbd/2022/04/msg00207.html
>
> There are still some io that are stuck, which means the devcie is
> probably still opened. Thus nbd_config_put() can't reach here.
> I'm afraid this patch can't fix the io hung.
>
> Matthew, can you try a test with this patch together with my patch below
> to comfirm my thought?
>
> nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> https://lists.debian.org/nbd/2022/04/msg00212.html.
>
> Thanks,
> Kuai