Carlos, Discussed with Eric today for what is the best way to get the device number for a namespace. The following patch seems a reasonable start although Eric would like to see how the helper is used in order to decide whether the interface looks right. commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) Author: Yonghong Song <yhs@xxxxxx> Date: Mon Sep 9 21:50:51 2019 -0700 nsfs: add an interface function ns_get_inum_dev() This patch added an interface function ns_get_inum_dev(). Given a ns_common structure, the function returns the inode and device numbers. The function will be used later by a newly added bpf helper. Signed-off-by: Yonghong Song <yhs@xxxxxx> diff --git a/fs/nsfs.c b/fs/nsfs.c index a0431642c6b5..a603c6fc3f54 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) return ERR_PTR(-EINVAL); } +/* Get the device number for the current task pidns. + */ +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) +{ + *inum = ns->inum; + *dev = nsfs_mnt->mnt_sb->s_dev; +} + static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) { struct inode *inode = d_inode(dentry); diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index d31cb6215905..b8fc680cdf1a 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct task_struct *task, typedef struct ns_common *ns_get_path_helper_t(void *); extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, void *private_data); +extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev); extern int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops); Could you put the above change and patch #1 and then have all your other patches? In your kernel change, please use interface function ns_get_inum_dev() to get pidns inode number and dev number. On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote: > Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and > provide technical insights needed on this one. > But how do we move this forward? > Al Viro's review is clear that this will not work and we should strip the name > resolution code (thanks for your detailed analysis). > As there is currently only one instance of the nsfs device on the system, > I think we could leave out the retrieval of the pidns device number and address it > when the situation changes. > What do you think? > > > On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote: >> >> >> On 9/6/19 5:10 PM, Al Viro wrote: >>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: >>> >>>> -bash-4.4$ readlink /proc/self/ns/pid >>>> pid:[4026531836] >>>> -bash-4.4$ stat /proc/self/ns/pid >>>> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ >>>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link >>>> Device: 4h/4d Inode: 344795989 Links: 1 >>>> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) >>>> Context: user_u:base_r:base_t >>>> Access: 2019-09-06 16:06:09.431616380 -0700 >>>> Modify: 2019-09-06 16:06:09.431616380 -0700 >>>> Change: 2019-09-06 16:06:09.431616380 -0700 >>>> Birth: - >>>> -bash-4.4$ >>>> >>>> Based on a discussion with Eric Biederman back in 2019 Linux >>>> Plumbers, Eric suggested that to uniquely identify a >>>> namespace, device id (major/minor) number should also >>>> be included. Although today's kernel implementation >>>> has the same device for all namespace pseudo files, >>>> but from uapi perspective, device id should be included. >>>> >>>> That is the reason why we try to get device id which holds >>>> pid namespace pseudo file. >>>> >>>> Do you have a better suggestion on how to get >>>> the device id for 'current' pid namespace? Or from design, we >>>> really should not care about device id at all? >>> >>> What the hell is "device id for pid namespace"? This is the >>> first time I've heard about that mystery object, so it's >>> hard to tell where it could be found. >>> >>> I can tell you what device numbers are involved in the areas >>> you seem to be looking in. >>> >>> 1) there's whatever device number that gets assigned to >>> (this) procfs instance. That, ironically, _is_ per-pidns, but >>> that of the procfs instance, not that of your process (and >>> those can be different). That's what you get in ->st_dev >>> when doing lstat() of anything in /proc (assuming that >>> procfs is mounted there, in the first place). NOTE: >>> that's lstat(2), not stat(2). stat(1) uses lstat(2), >>> unless given -L (in which case it's stat(2) time). The >>> difference: >>> >>> root@kvm1:~# stat /proc/self/ns/pid >>> File: /proc/self/ns/pid -> pid:[4026531836] >>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link >>> Device: 4h/4d Inode: 17396 Links: 1 >>> Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) >>> Access: 2019-09-06 19:43:11.871312319 -0400 >>> Modify: 2019-09-06 19:43:11.871312319 -0400 >>> Change: 2019-09-06 19:43:11.871312319 -0400 >>> Birth: - >>> root@kvm1:~# stat -L /proc/self/ns/pid >>> File: /proc/self/ns/pid >>> Size: 0 Blocks: 0 IO Block: 4096 regular empty file >>> Device: 3h/3d Inode: 4026531836 Links: 1 >>> Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) >>> Access: 2019-09-06 19:43:15.955313293 -0400 >>> Modify: 2019-09-06 19:43:15.955313293 -0400 >>> Change: 2019-09-06 19:43:15.955313293 -0400 >>> Birth: - >>> >>> The former is lstat, the latter - stat. >>> >>> 2) device number of the filesystem where the symlink target lives. >>> In this case, it's nsfs and there's only one instance on the entire >>> system. _That_ would be obtained by looking at st_dev in stat(2) on >>> /proc/self/ns/pid (0:3 above). >>> >>> 3) device number *OF* the symlink. That would be st_rdev in lstat(2). >>> There's none - it's a symlink, not a character or block device. It's >>> always zero and always will be zero. >>> >>> 4) the same for the target; st_rdev in stat(2) results and again, >>> there's no such beast - it's neither character nor block device. >>> >>> Your code is looking at (3). Please, reread any textbook on Unix >>> in the section that would cover stat(2) and discussion of the >>> difference between st_dev and st_rdev. >>> >>> I have no idea what Eric had been talking about - it's hard to >>> reconstruct by what you said so far. Making nsfs per-userns, >>> perhaps? But that makes no sense whatsoever, not that userns >>> ever had... Cheap shots aside, I really can't guess what that's >>> about. Sorry. >> >> Thanks for the detailed information. The device number we want >> is nsfs. Indeed, currently, there is only one instance >> on the entire system. But not exactly sure what is the possibility >> to have more than one nsfs device in the future. Maybe per-userns >> or any other criteria? >> >>> >>> In any case, pathname resolution is *NOT* for the situations where >>> you can't block. Even if it's procfs (and from the same pidns as >>> the process) mounted there, there is no promise that the target >>> of /proc/self has already been looked up and not evicted from >>> memory since then. And in case of cache miss pathwalk will >>> have to call ->lookup(), which requires locking the directory >>> (rw_sem, shared). You can't do that in such context. >>> >>> And that doesn't even go into the possibility that process has >>> something very different mounted on /proc. >>> >>> Again, I don't know what it is that you want to get to, but >>> I would strongly recommend finding a way to get to that data >>> that would not involve going anywhere near pathname resolution. >>> >>> How would you expect the userland to work with that value, >>> whatever it might be? If it's just a 32bit field that will >>> never be read, you might as well store there the same value >>> you store now (0, that is) in much cheaper and safer way ;-) >> >> Suppose inside pid namespace, user can pass the device number, >> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map >> or JIT). At runtime, bpf program will try to get device number, >> say n2, for the 'current' process. If n1 is not the same as >> n2, that means they are not in the same namespace. 'current' >> is in the same pid namespace as the user iff >> n1 == n2 and also pidns id is the same for 'current' and >> the one with `lsns -t pid`. >> >> Are you aware of any way to get the pidns device number >> for 'current' without going through the pathname >> lookup? >>