question on NBD idempotency

Eric Blake <eblake@xxxxxxxxxx> · Fri, 15 Nov 2024 09:43:24 -0600

I'm trying to develop a Kubernetes CSI driver that, among other
things, will be creating and tearing down NBD connections to other
hosts in the cluster, and am looking for idempotency design ideas.
Right now, when you call `nbd-client $host 10809`, nbd-client uses the
netlink interface to allocate an unused /dev/nbd$N device and outputs
the name of the device it created, then the userspace process exits
(unless TLS is in use, in which case the userspace sticks around to do
TLS translation of the TCP traffic into plaintext over Unix socketpair
to the kernel).  That means that any later `nbd-client -c /dev/nbd$N`
can output the pid of a process that no longer exists (or, less
likely, has been recycled into use by an unrelated process), making it
very difficult to have a race-free implementation that will be able to
look up which server(s) are currently in use by which NBD device(s),
since I can't use /proc/$pid/cmdline to see what server I originally
connected to.

In one direction, if I try to create an NBD device first and then
record the device name in a k8s CR, I run the risk that the CR update
fails.  A second attempt to `nbd-client $host 10809` will NOT report
that the server is already in use, but happily allocate yet another
device, so the only safe thing to do is if the attempt to record the
device name in a CR fails, I must immediately call `nbd-client -d
/dev/nbd$N` rather than use the first device, to avoid leaking it.

In the other direction, if I successfully record which /dev/nbd$N is
tied to a server after the device is created (and tear down the client
device if my recording is not successful), then I have a race in the
opposite direction: when I know it is time to clean up the device
because the server is going (or has already gone) away, if I try to
call `nbd-client -d /dev/nbd$N` more than once, I risk closing an
unrelated device on the second call if someone else allocated the same
id to an unrelated server in the meantime.  I have to be careful that
I don't clean up the device more than once, while still balancing
competing cleanups (cleaning up both my mapping and the client); I
shouldn't delete my mapping until I know the device is gone (so I
don't leak the device), but if cleaning up my mapping fails on the
first attempt and I need to retry it later, the second attempt should
not retry deleting the device.

It is possible to use `nbd-client -L $host $ip /dev/nbd$N` where _I_
manage the device numbers instead of letting netlink do
auto-allocation, but then I'm risking a race in the opposite
direction: if any other process in the system is also trying to
allocate NBD devices, the name I thought was unused when I called
nbd-client could end up already tied to a different server in parallel
by that other process, at which point I'm no longer guaranteed that
/dev/nbd$N is connected to the server I want.  So I really _do_ want
to use the netlink interface.

Is there an existing set of ioctls where the creation of an NBD device
could associate a user-space tag with the device, and I can then later
query the device to get the tag back?  A finite-length string would be
awesome (I could store "nbd://$ip:$port/$export" as the tag on
creation, to know precisely which server the device is talking to),
but even an integer tag (32- or 64-bit) might be enough (it's easier
to choose an integer tag in the full 2^64 namespace that is unlikely
to cause collisions with other processes on the system, than it is to
avoid collisions in the limited first few $N of the /dev/nbd$N device
names chosen to pick the lowest unused integer first).  If not, would
it be worth adding such ioctls for the NBD driver?

Usage-wise, I'm envisioning something like `nbd-client --tag $mytag
$host $ip` which creates the kernel device, associates the tag with
it, and outputs /dev/nbd$N on success; then later `nbd-client --tag -c
/dev/nbd$N` to output the tag name in addition to the originating pid
if the NBD device is still connected to the server.  Maybe even have a
way for `nbd-client --tag $tag -d /dev/nbd$N` which either atomically
succeeds (if the device indeed has that tag) or fails (if the tag does
not match what was already associated with the device).

But if there are no such ioctls (and no desire to accept a patch to
add them), then it looks like I _have_ to use /dev/nbd$N as the tag
that I map back to server details, and just be extremely careful in my
bookkeeping that I'm not racing in such a way that creates leaked
devices or which closes unintended devices, regardless of whether
there are secondary failures in trying to do the k8s bookkeeping to
track the mappings.  Ideas on how I can make this more robust would be
appreciated (for example, maybe it is more reliable to use symlinks in
the filesystem as my data store of mapped tags, than to try and
directly rely on k8s CR updates to synchronize).

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org