On Wed, Apr 27, 2022 at 09:29:12AM -0400, James Bottomley wrote: > On Tue, 2022-04-26 at 14:12 -0400, Demi Marie Obenour wrote: > > Right now, opening block devices in a race-free way is incredibly > > hard. > > Could you be more specific about what the race you're having problems > with is? What is racing. If I open /dev/mapper/qubes_dom0-vm--sys--net--private, it is possible that something has destroyed the corresponding device and created a new one with the same kernel name, *before* udev has managed to unlink the device node. As a result, I wind up opening the wrong device. > > The only reasonable approach I know of is sd_device_new_from_path() + > > sd_device_open(), and is only available in systemd git main. It also > > requires waiting on systemd-udev to have processed udev rules, which > > can be a bottleneck. > > This doesn't actually seem to be in my copy of systemd. That’s because it is not in any release yet. > > There are better approaches in various special cases, such as using > > device-mapper ioctls to check that the device one has opened still > > has the name and/or UUID one expects. However, none of them works > > for a plain call to open(2). > > Just so we're clear: if you call open on, say /dev/sdb1 and something > happens to hot unplug and then replug a different device under that > node, the file descriptor you got at open does *not* point to the new > node. It points to a dead device responder that errors everything. > > The point being once you open() something, the file descriptor is > guaranteed to point to the same device (or error). That doesn’t help if the unplug and replug happens between passing the path and udev having purged the now-stale symlink. > > A much better approach would be for udev to point its symlinks at > > "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at > > "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions. A > > filesystem would then be mounted at "/dev/disk/by-diskseq" that > > provides for race-free opening of these paths. This could be > > implemented in userspace using FUSE, either with difficulty using the > > current kernel API, or easily and efficiently using a new kernel API > > for opening a block device by diskseq + partition. However, I think > > this should be handled by the Linux kernel itself. > > > > What would be necessary to get this into the kernel? I would like to > > implement this, but I don’t have the time to do so anytime soon. Is > > anyone else interested in taking this on? I suspect the kernel code > > needed to implement this would be quite a bit smaller than the FUSE > > implementation. > > So it sounds like the problem is you want to be sure that the device > doesn't change after you've called libblkid to identify it but before > you call open? If that's so, the way you do this in userspace is to > call libblkid again after the open. If the before and after id match, > you're as sure as you can be the open was of the right device. The devices I am working with are raw-format VM disks that contain untrusted data. They are identified not by their content, which the VM has complete control over, but by various sysfs attributes such as dm/name and dm/uuid. And they need to be passed to interfaces, such as libvirt and cryptsetup, that only accept device paths. I can work around this in the case of cryptsetup by using the libcryptsetup library and/or holding a file descriptor open, but neither of those will work for libvirt since libvirtd is a separate process and I cannot pass a file descriptor to it. Furthermore, there is no way to make libvirtd do any post-open() checking on the file descriptor it has obtained. While I plan to add a workaround in libxl and blkback for loop and device-mapper devices, it is not reasonable to expect every userspace tool to do the same. The approach I am suggesting avoids this problem entirely, because /dev/mapper/qubes_dom0-vm--sys--net--private is now a symlink to a device node under /dev/disk/by-diskseq/$DISKSEQ. Those are never, ever reused. When the device goes away, the device node goes away too, and so any attempt to open the symlink (without O_PATH|O_NOFOLLOW) gets -ENOENT as it should. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab
Attachment:
signature.asc
Description: PGP signature