On 1/18/23 01:58, Jim Fehlig wrote: > On 12/16/22 08:04, Michal Privoznik wrote: >> Our code relies on mount events propagating into the namespace we >> create for a domain. However, there's one caveat. In v8.8.0-rc1~8 >> I've tried to make us detect differences in mount tables between >> the namespace in which libvirtd runs and the domain namespace. >> This is crucial for any mounts that happen after the domain was >> started (for instance new hugetlbfs can be mounted on say >> /dev/hugepages1G). >> >> Therefore, we take a look into /proc/$(pgrep qemu)/mounts to see >> what filesystems are mounted under /dev. Now, since we don't >> umount the original /dev, just mount a tmpfs over it, we get all >> the events (e.g. aforementioned hugetlbfs mount on >> /dev/hugepages1G), but we are not really able to access it >> because of the tmpfs that's placed on top. This then confuses our >> algorithm for detecting which filesystems are mounted (the >> algorithm is implemented in qemuDomainGetPreservedMounts()). >> >> To break the link between host's and guest's /dev we just need to >> umount() the original /dev in the namespace. Just before our >> artificially created tmpfs is moved into its place. >> >> Fixes: 46b03819ae8d833b11c2aaccb2c2a0361727f51b >> Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2151869#c6 >> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> >> --- >> src/qemu/qemu_namespace.c | 5 +++++ >> 1 file changed, 5 insertions(+) >> >> diff --git a/src/qemu/qemu_namespace.c b/src/qemu/qemu_namespace.c >> index 90c0b90024..a6b9af1307 100644 >> --- a/src/qemu/qemu_namespace.c >> +++ b/src/qemu/qemu_namespace.c >> @@ -775,6 +775,11 @@ qemuDomainUnshareNamespace(virQEMUDriverConfig *cfg, >> goto cleanup; >> } >> + if (umount("/dev") < 0) { >> + virReportSystemError(errno, "%s", _("failed to umount devfs >> on /dev")); >> + return -1; >> + } > > Hi Michal, > > While doing some downstream testing of 9.0.0 with apparmor confinement > of libvirtd enabled, I encountered the above error when attempting to > start a VM. Ah, I keep forgetting about apparmor, sorry. > Additionally from the audit subsystem > > kernel: audit: type=1400 audit(1674002774.461:13): apparmor="DENIED" > operation="umount" profile="libvirtd" name="/dev/" pid=4778 > comm="rpc-libvirtd" > > Hmm, libvirtd needs to umount /dev, is that right? I added > > umount /dev/, > > to the usr.sbin.libvirtd profile, reloaded it, and was then able to > start the VM. That seems like a big hammer. I was expecting the umount > in /run/libvirt/qemu/*.dev, but the profile already contains a rule for > that path. Yeah, we really need to umount /dev and not /run/libvirt/.../*.dev. The reason is twofold: 1) security - we used to mount a tmpfs over an existing /dev. Therefore, the original /dev with all devices was still available. If a malicious QEMU did just 'umount /dev' it would unmount the top most /dev (i.e. our tmpfs) exposing the original /dev. I agree that this is more theoretical threat than anything, because QEMU wouldn't probably have the capability to play with mount table. 2) as the commit message explains, it's needed because of qemuDomainGetPreservedMounts(). Long story short, when a QEMU is running and we are deciding whether to make a new node under /dev we look at the mount table of the QEMU process. This is because it's only devtmpfs we are replacing. All other filesystems mounted under /dev are preserved (e.g. /dev/pts, /dev/mqueue, /dev/shm, ...). Since these are preserved (i.e. they are NOT a copy of their image in the parent namespace, they ARE the parent namespace mounts), we don't need to mknod() anything. For instance, when hotplugging a serial console with /dev/pts/123 backend, we must not create it in the QEMU's namespace as it already exists. But we need to mknod() say /dev/sda1. Now, up until this it worked just fine. Problem is, when a new filesystem is mounted after QEMU was started. What I've seen (and what the associated bug is about), is when another hugetlbfs is mounted after a guest was started. For instance: 1) virsh start guest 2) mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G 3) systemctl restart libvirtd 4) hotplug a <memory/> into the guest that is backed by 1GB hugepages At the last step, we want to take a look into the mount table of running QEMU and see whether /dev/hugepages1G is mounted. Well, if the original devtmpfs is kept mounted, then the mount event is propagated into QEMU's mount table (even though it's inaccessible). Maybe the following shell commands explain it better: Domain 'fedora' started # nsenter -a -t $(pgrep qemu) mount | grep huge hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) hugetlb on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) # mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G # nsenter -a -t $(pgrep qemu) mount | grep huge hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) hugetlb on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) hugetlbfs on /dev/hugepages1G type hugetlbfs (rw,relatime,pagesize=1024M) # nsenter -a -t $(pgrep qemu) ls -l /dev/hugepages1G ls: cannot access '/dev/hugepages1G': No such file or directory The reason why this new mount point exists in the QEMU's mount table is the original devtmpfs. Therefore, to break this 'link', we need to umount the original devtmpfs after which things work as expected: # virsh start fedora Domain 'fedora' started # nsenter -a -t $(pgrep qemu) mount | grep huge hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) hugetlb on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) # mount -t hugetlbfs hugetlbfs /dev/hugepages1G -o pagesize=1G # nsenter -a -t $(pgrep qemu) mount | grep huge hugetlbfs on /dev/hugepages type hugetlbfs (rw,relatime,pagesize=2M) hugetlb on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb) I hope this explains the reason why we need to unmount the original /dev. Michal