[PROBLEM] nbd requests become stuck when devices watched by inotify emit udev uevent changes

Matthew Ruffell <matthew.ruffell@xxxxxxxxxxxxx> · Fri, 22 Apr 2022 17:42:24 +1200

Dear maintainers of the nbd subsystem,

A user has come across an issue which causes the nbd module to hang after a
disconnect where a write has been made to a qemu qcow image file, with qemu-nbd
being the server.

The issue is easily reproducible with the following:

Ubuntu 20.04, 22.04 or Fedora 36
Linux 5.18-rc2 or earlier (have tested 5.18-rc2, 5.15, 5.4, 4.15)
QEMU 6.2 or earlier

Instructions to reproduce:
==========================

$ sudo apt install qemu-nbd

$ cat << EOF >> reproducer.sh
#!/bin/bash

sudo modprobe nbd

while :
do
        qemu-img create -f qcow2 foo.img 500M
        sudo qemu-nbd --disconnect /dev/nbd15 || true
        sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
        sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
        sudo qemu-nbd --disconnect /dev/nbd15
done
EOF

$ chmod +x reproducer.sh
$ ./reproducer.sh

On Ubuntu, the terminal will pause within a minute or two, and dmesg will report
a lot of I/O errors, followed by hung task timeouts. On Fedora, it takes a 
little longer, but it will break in the same way within 10 minutes.

An example kernel log is below:

https://paste.ubuntu.com/p/5ZjC5b8MR7/

Debugging done:
===============

Looking at syslog, it seems systemd-udevd gets stuck, and enters D state.

systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time

$ ps aux
...
419    1194 root     D     0.1 systemd-udevd   -

I rebooted, and disabled systemd-udevd and its sockets, with:

$ sudo systemctl stop systemd-udevd.service
$ sudo systemctl stop systemd-udevd-control.socket
$ sudo systemctl stop systemd-udevd-kernel.socket

When running the reproducer again, everything works fine, and nbd subsystem does
not hang.

Looking at udev rules, I looked at the differences between those in Ubuntu 18.04
where the issue does not occur, and 20.04, where it does, and came across:

/usr/lib/udev/rules.d/60-block.rules

In 18.04:
# watch metadata changes, caused by tools closing the device node which was opened for writing
ACTION!="remove", SUBSYSTEM=="block", KERNEL=="loop*|nvme*|sd*|vd*|xvd*|pmem*|mmcblk*", OPTIONS+="watch"

In 20.04:
# watch metadata changes, caused by tools closing the device node which was opened for writing
ACTION!="remove", SUBSYSTEM=="block", \
  KERNEL=="loop*|mmcblk*[0-9]|msblk*[0-9]|mspblk*[0-9]|nvme*|sd*|vd*|xvd*|bcache*|cciss*|dasd*|ubd*|ubi*|scm*|pmem*|nbd*|zd*", \
  OPTIONS+="watch"

The difference is the OPTIONS+="watch" being added to nbd* any event not remove.

When I deleted nbd* and retried the reproducer again, everything works smoothly.

Looking at the manpage for udev:

> watch
>   Watch the device node with inotify; when the node is closed after being 
>   opened for writing, a change uevent is synthesized.
>
> nowatch
>   Disable the watching of a device node with inotify.

It appears that there is some sort of problem where systemd-udevd uses inotify
to watch for updates to the device metadata, and it blocks a subsequent
disconnect request, causing it to fail with:

block nbd15: Send disconnect failed -32

After which we start seeing stuck requests:

block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds

All userspace calls to the nbd module hang, and the system has to be rebooted.

Workaround:
===========

We can workaround the issue by adding a higher priority udev rule to not watch
nbd* devices.

$ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
# Disable inotify watching of change events for NBD devices
ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
EOF

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger

Help on debugging the problem:
==============================

I need some help debugging the problem, as I am not quite sure how to trace
the interactions between inotify and nbd.

I am happy to help debug the issue, or try any patches that gather debugging
data or any potential fixes.

Thanks,
Matthew