On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: Thanks a lot Yonghong. I'll include this patch when submitting changes for version 11 of this patch. > > Carlos, > > Discussed with Eric today for what is the best way to get > the device number for a namespace. The following patch seems > a reasonable start although Eric would like to see > how the helper is used in order to decide whether the > interface looks right. > > commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) > Author: Yonghong Song <yhs@xxxxxx> > Date: Mon Sep 9 21:50:51 2019 -0700 > > nsfs: add an interface function ns_get_inum_dev() > > This patch added an interface function > ns_get_inum_dev(). Given a ns_common structure, > the function returns the inode and device > numbers. The function will be used later > by a newly added bpf helper. > > Signed-off-by: Yonghong Song <yhs@xxxxxx> > > diff --git a/fs/nsfs.c b/fs/nsfs.c > index a0431642c6b5..a603c6fc3f54 100644 > --- a/fs/nsfs.c > +++ b/fs/nsfs.c > @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) > return ERR_PTR(-EINVAL); > } > > +/* Get the device number for the current task pidns. > + */ > +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) > +{ > + *inum = ns->inum; > + *dev = nsfs_mnt->mnt_sb->s_dev; > +} > + > static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) > { > struct inode *inode = d_inode(dentry); > diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h > index d31cb6215905..b8fc680cdf1a 100644 > --- a/include/linux/proc_ns.h > +++ b/include/linux/proc_ns.h > @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct > task_struct *task, > typedef struct ns_common *ns_get_path_helper_t(void *); > extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t > ns_get_cb, > void *private_data); > +extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev); > > extern int ns_get_name(char *buf, size_t size, struct task_struct *task, > const struct proc_ns_operations *ns_ops); > > Could you put the above change and patch #1 and then have > all your other patches? In your kernel change, please use > interface function ns_get_inum_dev() to get pidns inode number > and dev number. > > On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote: > > Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and > > provide technical insights needed on this one. > > But how do we move this forward? > > Al Viro's review is clear that this will not work and we should strip the name > > resolution code (thanks for your detailed analysis). > > As there is currently only one instance of the nsfs device on the system, > > I think we could leave out the retrieval of the pidns device number and address it > > when the situation changes. > > What do you think? > > > > > > On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote: > >> > >> > >> On 9/6/19 5:10 PM, Al Viro wrote: > >>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote: > >>> > >>>> -bash-4.4$ readlink /proc/self/ns/pid > >>>> pid:[4026531836] > >>>> -bash-4.4$ stat /proc/self/ns/pid > >>>> File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’ > >>>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link > >>>> Device: 4h/4d Inode: 344795989 Links: 1 > >>>> Access: (0777/lrwxrwxrwx) Uid: (128203/ yhs) Gid: ( 100/ users) > >>>> Context: user_u:base_r:base_t > >>>> Access: 2019-09-06 16:06:09.431616380 -0700 > >>>> Modify: 2019-09-06 16:06:09.431616380 -0700 > >>>> Change: 2019-09-06 16:06:09.431616380 -0700 > >>>> Birth: - > >>>> -bash-4.4$ > >>>> > >>>> Based on a discussion with Eric Biederman back in 2019 Linux > >>>> Plumbers, Eric suggested that to uniquely identify a > >>>> namespace, device id (major/minor) number should also > >>>> be included. Although today's kernel implementation > >>>> has the same device for all namespace pseudo files, > >>>> but from uapi perspective, device id should be included. > >>>> > >>>> That is the reason why we try to get device id which holds > >>>> pid namespace pseudo file. > >>>> > >>>> Do you have a better suggestion on how to get > >>>> the device id for 'current' pid namespace? Or from design, we > >>>> really should not care about device id at all? > >>> > >>> What the hell is "device id for pid namespace"? This is the > >>> first time I've heard about that mystery object, so it's > >>> hard to tell where it could be found. > >>> > >>> I can tell you what device numbers are involved in the areas > >>> you seem to be looking in. > >>> > >>> 1) there's whatever device number that gets assigned to > >>> (this) procfs instance. That, ironically, _is_ per-pidns, but > >>> that of the procfs instance, not that of your process (and > >>> those can be different). That's what you get in ->st_dev > >>> when doing lstat() of anything in /proc (assuming that > >>> procfs is mounted there, in the first place). NOTE: > >>> that's lstat(2), not stat(2). stat(1) uses lstat(2), > >>> unless given -L (in which case it's stat(2) time). The > >>> difference: > >>> > >>> root@kvm1:~# stat /proc/self/ns/pid > >>> File: /proc/self/ns/pid -> pid:[4026531836] > >>> Size: 0 Blocks: 0 IO Block: 1024 symbolic link > >>> Device: 4h/4d Inode: 17396 Links: 1 > >>> Access: (0777/lrwxrwxrwx) Uid: ( 0/ root) Gid: ( 0/ root) > >>> Access: 2019-09-06 19:43:11.871312319 -0400 > >>> Modify: 2019-09-06 19:43:11.871312319 -0400 > >>> Change: 2019-09-06 19:43:11.871312319 -0400 > >>> Birth: - > >>> root@kvm1:~# stat -L /proc/self/ns/pid > >>> File: /proc/self/ns/pid > >>> Size: 0 Blocks: 0 IO Block: 4096 regular empty file > >>> Device: 3h/3d Inode: 4026531836 Links: 1 > >>> Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > >>> Access: 2019-09-06 19:43:15.955313293 -0400 > >>> Modify: 2019-09-06 19:43:15.955313293 -0400 > >>> Change: 2019-09-06 19:43:15.955313293 -0400 > >>> Birth: - > >>> > >>> The former is lstat, the latter - stat. > >>> > >>> 2) device number of the filesystem where the symlink target lives. > >>> In this case, it's nsfs and there's only one instance on the entire > >>> system. _That_ would be obtained by looking at st_dev in stat(2) on > >>> /proc/self/ns/pid (0:3 above). > >>> > >>> 3) device number *OF* the symlink. That would be st_rdev in lstat(2). > >>> There's none - it's a symlink, not a character or block device. It's > >>> always zero and always will be zero. > >>> > >>> 4) the same for the target; st_rdev in stat(2) results and again, > >>> there's no such beast - it's neither character nor block device. > >>> > >>> Your code is looking at (3). Please, reread any textbook on Unix > >>> in the section that would cover stat(2) and discussion of the > >>> difference between st_dev and st_rdev. > >>> > >>> I have no idea what Eric had been talking about - it's hard to > >>> reconstruct by what you said so far. Making nsfs per-userns, > >>> perhaps? But that makes no sense whatsoever, not that userns > >>> ever had... Cheap shots aside, I really can't guess what that's > >>> about. Sorry. > >> > >> Thanks for the detailed information. The device number we want > >> is nsfs. Indeed, currently, there is only one instance > >> on the entire system. But not exactly sure what is the possibility > >> to have more than one nsfs device in the future. Maybe per-userns > >> or any other criteria? > >> > >>> > >>> In any case, pathname resolution is *NOT* for the situations where > >>> you can't block. Even if it's procfs (and from the same pidns as > >>> the process) mounted there, there is no promise that the target > >>> of /proc/self has already been looked up and not evicted from > >>> memory since then. And in case of cache miss pathwalk will > >>> have to call ->lookup(), which requires locking the directory > >>> (rw_sem, shared). You can't do that in such context. > >>> > >>> And that doesn't even go into the possibility that process has > >>> something very different mounted on /proc. > >>> > >>> Again, I don't know what it is that you want to get to, but > >>> I would strongly recommend finding a way to get to that data > >>> that would not involve going anywhere near pathname resolution. > >>> > >>> How would you expect the userland to work with that value, > >>> whatever it might be? If it's just a 32bit field that will > >>> never be read, you might as well store there the same value > >>> you store now (0, that is) in much cheaper and safer way ;-) > >> > >> Suppose inside pid namespace, user can pass the device number, > >> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map > >> or JIT). At runtime, bpf program will try to get device number, > >> say n2, for the 'current' process. If n1 is not the same as > >> n2, that means they are not in the same namespace. 'current' > >> is in the same pid namespace as the user iff > >> n1 == n2 and also pidns id is the same for 'current' and > >> the one with `lsns -t pid`. > >> > >> Are you aware of any way to get the pidns device number > >> for 'current' without going through the pathname > >> lookup? > >>