Re: procfs: mnt namespace behaviour with block devices (resend)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Craig,

On 5/9/22 03:20, Craig Small wrote:
> (resending as plain text as the first got bounced)
> 
> Hi,
>   I'm the maintainer of the psmisc package that provides system tools
> for things like fuser and killall. I am trying to establish if
> something I have found with the proc filesystem is as intended
> (knowing why would be nice) or if it's a strange corner-case bug.
> 
> Apologies to the non-procfs maintainers but these two lists are what
> MAINTAINER said to go to. If you could CC me on replies that would be
> great.
> 
> The proc file descriptor for a block device mounted in a different
> namespace will show the device id of that different namespace and not
> the device id of the process stat()ing the file.
> 
> The issue came up in fuser not finding certain processes that were
> directly accessing a block device, see
> https://gitlab.com/psmisc/psmisc/-/issues/39 Programs such as lsof are
> caught by this too.
> 
> My question is: When I am in the bash mount namespace (4026531840 below)
> then shouldn't all the device IDs be from that namespace? In other
> words, the device id of the dereferenced symlink and what it points to
> are the same (device id 5) and not symlink has 44 and /dev/dm-8 has 5.
I'm no expert here, but I think this is working as intended.
It's definitely confusing!

Consider a process in a separate mount namespace from the init
namespace, e.g. a container. Say I were to open python in that container
and then do `os.open("/etc/passwd")`. If I were to then look at that
process's file descriptors (from the host's perspective), I'd see the
following (pid 220854 is the python process in the container):

$ ls -lah /proc/220854/fd/
total 0
dr-x------ 2 stepbren stepbren  0 May  9 11:06 .
dr-xr-xr-x 9 stepbren stepbren  0 May  9 11:06 ..
lrwx------ 1 stepbren stepbren 64 May  9 11:06 0 -> /dev/pts/0
lrwx------ 1 stepbren stepbren 64 May  9 11:06 1 -> /dev/pts/0
lrwx------ 1 stepbren stepbren 64 May  9 11:06 2 -> /dev/pts/0
lr-x------ 1 stepbren stepbren 64 May  9 11:06 3 -> /etc/passwd

$ cat /proc/220854/fd/3
<contents of container /etc/passwd>

$ cat /etc/passwd
<contents of host /etc/passwd>

$ stat -L /proc/220854/fd/3
  File: /proc/220854/fd/3
  Size: 900             Blocks: 8          IO Block: 4096   regular file
Device: 4eh/78d Inode: 5508982     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-10-27 10:24:28.000000000 -0700
Modify: 2020-10-27 10:24:28.000000000 -0700
Change: 2020-10-27 10:24:30.255374190 -0700
 Birth: 2020-10-27 10:24:30.255374190 -0700

$ stat /etc/passwd
  File: /etc/passwd
  Size: 3216            Blocks: 8          IO Block: 4096   regular file
Device: fd01h/64769d    Inode: 24917416    Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-05-08 15:06:18.837117765 -0700
Modify: 2021-11-30 09:08:45.163873193 -0800
Change: 2021-11-30 09:08:45.167873237 -0800
 Birth: 2021-11-30 09:08:45.163873193 -0800

## INSIDE CONTAINER'S MOUNT NAMESPACE
$ stat /etc/passwd
  File: /etc/passwd
  Size: 900             Blocks: 8          IO Block: 4096   regular file
Device: 4eh/78d Inode: 5508982     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-10-27 17:24:28.000000000 +0000
Modify: 2020-10-27 17:24:28.000000000 +0000
Change: 2020-10-27 17:24:30.255374190 +0000
 Birth: -

As you can see, it's the same behavior: the path /etc/passwd resolves to
a different inode in the init mount namespace compared to the
container's mount namespace. The secret sauce of the /proc/$pid/fd/$fd
files is that they don't behave like a normal symlink: instead of using
the file path to lookup the target inode, they directly lookup the file
and inode of the target process's table.

When you do a readlink(), the kernel has to create a path string, and it
has to do it from the perspective of the mount namespace of $pid, not
your monitoring command. The reason is that there may not even be a
corresponding path outside the mount namespace of $pid. Imagine I
created and opened "/etc/foobar" inside the container: that file may not
exist outside the container, so how could readlink() make a path
specific to your mount namespace?

Hopefully this helps, but maybe I'm off base and missing the thrust of
your question, let me know either way.

Stephen

> 
> I get that if I could look at the device IDs in qemu or use nsenter to
> switch to its namespace, then the device should be 44 for the symlink
> and device (which it is and seems correct to me).
> 
> How to replicate
> =============
> # uname -a
> Linux elmo 5.16.0-5-amd64 #1 SMP PREEMPT Debian 5.16.14-1 (2022-03-15)
> x86_64 GNU/Linux
> 
> The easiest way to replicate this is to make a qemu virtual machine and
> have it mount a block device. I suspect there are other ways, but I
> don't have many things that mount a device and switch namespaces. The
> qemu process (here it is 136775) will have a different mount namespace.
> 
> # ps -o pid,mntns,comm $$ 136775
>     PID      MNTNS COMMAND
>  136775 4026532762 qemu-system-x86
>  142359 4026531840 bash
> 
> File descriptor 23 is what qemu is using to mount the block device
> # ls -l /proc/136775/fd/23
> lrwx------ 1 libvirt-qemu libvirt-qemu 64 Apr 12 16:34
> /proc/136775/fd/23 -> /dev/dm-8
> 
> However, the dereferenced symlink and where the symlink points to show
> different data.
> 
> # stat -L /proc/136775/fd/23
>   File: /proc/136775/fd/23
>   Size: 0         Blocks: 0          IO Block: 4096   block special file
> Device: 2ch/44d Inode: 9           Links: 1     Device type: fd,8
> Access: (0660/brw-rw----)  Uid: (64055/libvirt-qemu)   Gid: (64055/libvirt-qemu)
> Access: 2022-04-12 16:34:25.687147886 +1000
> Modify: 2022-04-12 16:34:25.519151533 +1000
> Change: 2022-04-12 16:34:25.595149882 +1000
>  Birth: -
> 
> # stat /dev/dm-8
>   File: /dev/dm-8
>   Size: 0         Blocks: 0          IO Block: 4096   block special file
> Device: 5h/5d Inode: 348         Links: 1     Device type: fd,8
> Access: (0660/brw-rw----)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2022-04-12 16:15:12.684434884 +1000
> Modify: 2022-04-12 16:15:12.684434884 +1000
> Change: 2022-04-12 16:15:12.684434884 +1000
>  Birth: -
> 
> If we change to the qemu process' mount namespace then we do see that
> /dev/dm-8 has the same device/inode as the symlink.
> 
> # nsenter -m -t 136775 stat /dev/dm-8
>   File: /dev/dm-8
>   Size: 0         Blocks: 0          IO Block: 4096   block special file
> Device: 2ch/44d Inode: 9           Links: 1     Device type: fd,8
> Access: (0660/brw-rw----)  Uid: (64055/libvirt-qemu)   Gid: (64055/libvirt-qemu)
> Access: 2022-04-12 16:34:25.687147886 +1000
> Modify: 2022-04-12 16:34:25.519151533 +1000
> Change: 2022-04-12 16:34:25.595149882 +1000
>  Birth: -
> 
> Thanks for your time.
> 
>  - Craig




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux