Re: Race-free block device opening

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 27, 2022 at 09:29:12AM -0400, James Bottomley wrote:
> On Tue, 2022-04-26 at 14:12 -0400, Demi Marie Obenour wrote:
> > Right now, opening block devices in a race-free way is incredibly
> > hard.
> 
> Could you be more specific about what the race you're having problems
> with is?  What is racing.

If I open /dev/mapper/qubes_dom0-vm--sys--net--private, it is possible
that something has destroyed the corresponding device and created a new
one with the same kernel name, *before* udev has managed to unlink the
device node.  As a result, I wind up opening the wrong device.

> > The only reasonable approach I know of is sd_device_new_from_path() +
> > sd_device_open(), and is only available in systemd git main.  It also
> > requires waiting on systemd-udev to have processed udev rules, which
> > can be a bottleneck.
> 
> This doesn't actually seem to be in my copy of systemd.

That’s because it is not in any release yet.

> >   There are better approaches in various special cases, such as using
> > device-mapper ioctls to check that the device one has opened still
> > has the name and/or UUID one expects.  However, none of them works
> > for a plain call to open(2).
> 
> Just so we're clear: if you call open on, say /dev/sdb1 and something
> happens to hot unplug and then replug a different device under that
> node, the file descriptor you got at open does *not* point to the new
> node.  It points to a dead device responder that errors everything.
> 
> The point being once you open() something, the file descriptor is
> guaranteed to point to the same device (or error).

That doesn’t help if the unplug and replug happens between passing the
path and udev having purged the now-stale symlink.

> > A much better approach would be for udev to point its symlinks at
> > "/dev/disk/by-diskseq/$DISKSEQ" for non-partition disk devices, or at
> > "/dev/disk/by-diskseq/${DISKSEQ}p${PARTITION}" for partitions.  A
> > filesystem would then be mounted at "/dev/disk/by-diskseq" that
> > provides for race-free opening of these paths.  This could be
> > implemented in userspace using FUSE, either with difficulty using the
> > current kernel API, or easily and efficiently using a new kernel API
> > for opening a block device by diskseq + partition.  However, I think
> > this should be handled by the Linux kernel itself.
> > 
> > What would be necessary to get this into the kernel?  I would like to
> > implement this, but I don’t have the time to do so anytime soon.  Is
> > anyone else interested in taking this on?  I suspect the kernel code
> > needed to implement this would be quite a bit smaller than the FUSE
> > implementation.
> 
> So it sounds like the problem is you want to be sure that the device
> doesn't change after you've called libblkid to identify it but before
> you call open?  If that's so, the way you do this in userspace is to
> call libblkid again after the open.  If the before and after id match,
> you're as sure as you can be the open was of the right device.

The devices I am working with are raw-format VM disks that contain
untrusted data.  They are identified not by their content, which the VM
has complete control over, but by various sysfs attributes such as
dm/name and dm/uuid.  And they need to be passed to interfaces, such as
libvirt and cryptsetup, that only accept device paths.

I can work around this in the case of cryptsetup by using the
libcryptsetup library and/or holding a file descriptor open, but neither
of those will work for libvirt since libvirtd is a separate process and
I cannot pass a file descriptor to it.  Furthermore, there is no way to
make libvirtd do any post-open() checking on the file descriptor it has
obtained.  While I plan to add a workaround in libxl and blkback for
loop and device-mapper devices, it is not reasonable to expect every
userspace tool to do the same.  

The approach I am suggesting avoids this problem entirely, because
/dev/mapper/qubes_dom0-vm--sys--net--private is now a symlink to a
device node under /dev/disk/by-diskseq/$DISKSEQ.  Those are never, ever
reused.  When the device goes away, the device node goes away too, and
so any attempt to open the symlink (without O_PATH|O_NOFOLLOW) gets
-ENOENT as it should.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Invisible Things Lab

Attachment: signature.asc
Description: PGP signature


[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux