Hi Craig, On 5/9/22 03:20, Craig Small wrote: > (resending as plain text as the first got bounced) > > Hi, > I'm the maintainer of the psmisc package that provides system tools > for things like fuser and killall. I am trying to establish if > something I have found with the proc filesystem is as intended > (knowing why would be nice) or if it's a strange corner-case bug. > > Apologies to the non-procfs maintainers but these two lists are what > MAINTAINER said to go to. If you could CC me on replies that would be > great. > > The proc file descriptor for a block device mounted in a different > namespace will show the device id of that different namespace and not > the device id of the process stat()ing the file. > > The issue came up in fuser not finding certain processes that were > directly accessing a block device, see > https://gitlab.com/psmisc/psmisc/-/issues/39 Programs such as lsof are > caught by this too. > > My question is: When I am in the bash mount namespace (4026531840 below) > then shouldn't all the device IDs be from that namespace? In other > words, the device id of the dereferenced symlink and what it points to > are the same (device id 5) and not symlink has 44 and /dev/dm-8 has 5. I'm no expert here, but I think this is working as intended. It's definitely confusing! Consider a process in a separate mount namespace from the init namespace, e.g. a container. Say I were to open python in that container and then do `os.open("/etc/passwd")`. If I were to then look at that process's file descriptors (from the host's perspective), I'd see the following (pid 220854 is the python process in the container): $ ls -lah /proc/220854/fd/ total 0 dr-x------ 2 stepbren stepbren 0 May 9 11:06 . dr-xr-xr-x 9 stepbren stepbren 0 May 9 11:06 .. lrwx------ 1 stepbren stepbren 64 May 9 11:06 0 -> /dev/pts/0 lrwx------ 1 stepbren stepbren 64 May 9 11:06 1 -> /dev/pts/0 lrwx------ 1 stepbren stepbren 64 May 9 11:06 2 -> /dev/pts/0 lr-x------ 1 stepbren stepbren 64 May 9 11:06 3 -> /etc/passwd $ cat /proc/220854/fd/3 <contents of container /etc/passwd> $ cat /etc/passwd <contents of host /etc/passwd> $ stat -L /proc/220854/fd/3 File: /proc/220854/fd/3 Size: 900 Blocks: 8 IO Block: 4096 regular file Device: 4eh/78d Inode: 5508982 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-27 10:24:28.000000000 -0700 Modify: 2020-10-27 10:24:28.000000000 -0700 Change: 2020-10-27 10:24:30.255374190 -0700 Birth: 2020-10-27 10:24:30.255374190 -0700 $ stat /etc/passwd File: /etc/passwd Size: 3216 Blocks: 8 IO Block: 4096 regular file Device: fd01h/64769d Inode: 24917416 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2022-05-08 15:06:18.837117765 -0700 Modify: 2021-11-30 09:08:45.163873193 -0800 Change: 2021-11-30 09:08:45.167873237 -0800 Birth: 2021-11-30 09:08:45.163873193 -0800 ## INSIDE CONTAINER'S MOUNT NAMESPACE $ stat /etc/passwd File: /etc/passwd Size: 900 Blocks: 8 IO Block: 4096 regular file Device: 4eh/78d Inode: 5508982 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2020-10-27 17:24:28.000000000 +0000 Modify: 2020-10-27 17:24:28.000000000 +0000 Change: 2020-10-27 17:24:30.255374190 +0000 Birth: - As you can see, it's the same behavior: the path /etc/passwd resolves to a different inode in the init mount namespace compared to the container's mount namespace. The secret sauce of the /proc/$pid/fd/$fd files is that they don't behave like a normal symlink: instead of using the file path to lookup the target inode, they directly lookup the file and inode of the target process's table. When you do a readlink(), the kernel has to create a path string, and it has to do it from the perspective of the mount namespace of $pid, not your monitoring command. The reason is that there may not even be a corresponding path outside the mount namespace of $pid. Imagine I created and opened "/etc/foobar" inside the container: that file may not exist outside the container, so how could readlink() make a path specific to your mount namespace? Hopefully this helps, but maybe I'm off base and missing the thrust of your question, let me know either way. Stephen > > I get that if I could look at the device IDs in qemu or use nsenter to > switch to its namespace, then the device should be 44 for the symlink > and device (which it is and seems correct to me). > > How to replicate > ============= > # uname -a > Linux elmo 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15) > x86_64 GNU/Linux > > The easiest way to replicate this is to make a qemu virtual machine and > have it mount a block device. I suspect there are other ways, but I > don't have many things that mount a device and switch namespaces. The > qemu process (here it is 136775) will have a different mount namespace. > > # ps -o pid,mntns,comm $$ 136775 > PID MNTNS COMMAND > 136775 4026532762 qemu-system-x86 > 142359 4026531840 bash > > File descriptor 23 is what qemu is using to mount the block device > # ls -l /proc/136775/fd/23 > lrwx------ 1 libvirt-qemu libvirt-qemu 64 Apr 12 16:34 > /proc/136775/fd/23 -> /dev/dm-8 > > However, the dereferenced symlink and where the symlink points to show > different data. > > # stat -L /proc/136775/fd/23 > File: /proc/136775/fd/23 > Size: 0 Blocks: 0 IO Block: 4096 block special file > Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8 > Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu) > Access: 2022-04-12 16:34:25.687147886 +1000 > Modify: 2022-04-12 16:34:25.519151533 +1000 > Change: 2022-04-12 16:34:25.595149882 +1000 > Birth: - > > # stat /dev/dm-8 > File: /dev/dm-8 > Size: 0 Blocks: 0 IO Block: 4096 block special file > Device: 5h/5d Inode: 348 Links: 1 Device type: fd,8 > Access: (0660/brw-rw----) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2022-04-12 16:15:12.684434884 +1000 > Modify: 2022-04-12 16:15:12.684434884 +1000 > Change: 2022-04-12 16:15:12.684434884 +1000 > Birth: - > > If we change to the qemu process' mount namespace then we do see that > /dev/dm-8 has the same device/inode as the symlink. > > # nsenter -m -t 136775 stat /dev/dm-8 > File: /dev/dm-8 > Size: 0 Blocks: 0 IO Block: 4096 block special file > Device: 2ch/44d Inode: 9 Links: 1 Device type: fd,8 > Access: (0660/brw-rw----) Uid: (64055/libvirt-qemu) Gid: (64055/libvirt-qemu) > Access: 2022-04-12 16:34:25.687147886 +1000 > Modify: 2022-04-12 16:34:25.519151533 +1000 > Change: 2022-04-12 16:34:25.595149882 +1000 > Birth: - > > Thanks for your time. > > - Craig