I'm trying to develop a Kubernetes CSI driver that, among other things, will be creating and tearing down NBD connections to other hosts in the cluster, and am looking for idempotency design ideas. Right now, when you call `nbd-client $host 10809`, nbd-client uses the netlink interface to allocate an unused /dev/nbd$N device and outputs the name of the device it created, then the userspace process exits (unless TLS is in use, in which case the userspace sticks around to do TLS translation of the TCP traffic into plaintext over Unix socketpair to the kernel). That means that any later `nbd-client -c /dev/nbd$N` can output the pid of a process that no longer exists (or, less likely, has been recycled into use by an unrelated process), making it very difficult to have a race-free implementation that will be able to look up which server(s) are currently in use by which NBD device(s), since I can't use /proc/$pid/cmdline to see what server I originally connected to. In one direction, if I try to create an NBD device first and then record the device name in a k8s CR, I run the risk that the CR update fails. A second attempt to `nbd-client $host 10809` will NOT report that the server is already in use, but happily allocate yet another device, so the only safe thing to do is if the attempt to record the device name in a CR fails, I must immediately call `nbd-client -d /dev/nbd$N` rather than use the first device, to avoid leaking it. In the other direction, if I successfully record which /dev/nbd$N is tied to a server after the device is created (and tear down the client device if my recording is not successful), then I have a race in the opposite direction: when I know it is time to clean up the device because the server is going (or has already gone) away, if I try to call `nbd-client -d /dev/nbd$N` more than once, I risk closing an unrelated device on the second call if someone else allocated the same id to an unrelated server in the meantime. I have to be careful that I don't clean up the device more than once, while still balancing competing cleanups (cleaning up both my mapping and the client); I shouldn't delete my mapping until I know the device is gone (so I don't leak the device), but if cleaning up my mapping fails on the first attempt and I need to retry it later, the second attempt should not retry deleting the device. It is possible to use `nbd-client -L $host $ip /dev/nbd$N` where _I_ manage the device numbers instead of letting netlink do auto-allocation, but then I'm risking a race in the opposite direction: if any other process in the system is also trying to allocate NBD devices, the name I thought was unused when I called nbd-client could end up already tied to a different server in parallel by that other process, at which point I'm no longer guaranteed that /dev/nbd$N is connected to the server I want. So I really _do_ want to use the netlink interface. Is there an existing set of ioctls where the creation of an NBD device could associate a user-space tag with the device, and I can then later query the device to get the tag back? A finite-length string would be awesome (I could store "nbd://$ip:$port/$export" as the tag on creation, to know precisely which server the device is talking to), but even an integer tag (32- or 64-bit) might be enough (it's easier to choose an integer tag in the full 2^64 namespace that is unlikely to cause collisions with other processes on the system, than it is to avoid collisions in the limited first few $N of the /dev/nbd$N device names chosen to pick the lowest unused integer first). If not, would it be worth adding such ioctls for the NBD driver? Usage-wise, I'm envisioning something like `nbd-client --tag $mytag $host $ip` which creates the kernel device, associates the tag with it, and outputs /dev/nbd$N on success; then later `nbd-client --tag -c /dev/nbd$N` to output the tag name in addition to the originating pid if the NBD device is still connected to the server. Maybe even have a way for `nbd-client --tag $tag -d /dev/nbd$N` which either atomically succeeds (if the device indeed has that tag) or fails (if the tag does not match what was already associated with the device). But if there are no such ioctls (and no desire to accept a patch to add them), then it looks like I _have_ to use /dev/nbd$N as the tag that I map back to server details, and just be extremely careful in my bookkeeping that I'm not racing in such a way that creates leaked devices or which closes unintended devices, regardless of whether there are secondary failures in trying to do the k8s bookkeeping to track the mappings. Ideas on how I can make this more robust would be appreciated (for example, maybe it is more reliable to use symlinks in the filesystem as my data store of mapped tags, than to try and directly rely on k8s CR updates to synchronize). -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org