On 9/11/19 9:16 AM, Eric W. Biederman wrote: > Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > >> On Tue, Sep 10, 2019 at 10:35:09PM +0000, Yonghong Song wrote: >>> >>> Carlos, >>> >>> Discussed with Eric today for what is the best way to get >>> the device number for a namespace. The following patch seems >>> a reasonable start although Eric would like to see >>> how the helper is used in order to decide whether the >>> interface looks right. >>> >>> commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2) >>> Author: Yonghong Song <yhs@xxxxxx> >>> Date: Mon Sep 9 21:50:51 2019 -0700 >>> >>> nsfs: add an interface function ns_get_inum_dev() >>> >>> This patch added an interface function >>> ns_get_inum_dev(). Given a ns_common structure, >>> the function returns the inode and device >>> numbers. The function will be used later >>> by a newly added bpf helper. >>> >>> Signed-off-by: Yonghong Song <yhs@xxxxxx> >>> >>> diff --git a/fs/nsfs.c b/fs/nsfs.c >>> index a0431642c6b5..a603c6fc3f54 100644 >>> --- a/fs/nsfs.c >>> +++ b/fs/nsfs.c >>> @@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd) >>> return ERR_PTR(-EINVAL); >>> } >>> >>> +/* Get the device number for the current task pidns. >>> + */ >>> +void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev) >>> +{ >>> + *inum = ns->inum; >>> + *dev = nsfs_mnt->mnt_sb->s_dev; >>> +} >> >> Umm... Where would it get the device number once we get (hell knows >> what for) multiple nsfs instances? I still don't understand what >> would that be about, TBH... Is it really per-userns? Or something >> else entirely? Eric, could you give some context? > > My goal is not to paint things into a corner, with future changes. > Right now it is possible to stat a namespace file descriptor and > get a device and inode number. Then compare that. > > I don't want people using the inode number in nsfd as some magic > namespace id. > > We have had times in the past where there was more than one superblock > and thus more than one device number. Further if userspace ever uses > this heavily there may be times in the future where for > checkpoint/restart purposes we will want multiple nsfd's so we can > preserve the inode number accross a migration. > > Realistically there will probably just some kind of hotplug notification > to userspace to say we have hotplugged your operatining system as > a migration notification. > > Now the halway discussion did not quite capture everything I was trying > to say but it at least got to the right ballpark. > > The helper in fs/nsfs.c should be: > > bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) > { > return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); > } > > That way if/when there are multiple inodes identifying the same > namespace the bpf programs don't need to change. Thanks, Eric. This is indeed better. The bpf helper should focus on comparing dev/ino, instead of return the dev/ino to bpf program. So overall, nsfs related change will look like: diff --git a/fs/nsfs.c b/fs/nsfs.c index a0431642c6b5..7e78d89c2172 100644 --- a/fs/nsfs.c +++ b/fs/nsfs.c @@ -245,6 +245,11 @@ struct file *proc_ns_fget(int fd) return ERR_PTR(-EINVAL); } +bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino) +{ + return ((ns->inum == ino) && (nsfs_mnt->mnt_sb->s_dev == dev)); +} + static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry) { struct inode *inode = d_inode(dentry); diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h index d31cb6215905..79639807e960 100644 --- a/include/linux/proc_ns.h +++ b/include/linux/proc_ns.h @@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct task_struct *task, typedef struct ns_common *ns_get_path_helper_t(void *); extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t ns_get_cb, void *private_data); +extern bool ns_match(const struct ns_common *ns, dev_t dev, ino_t ino); extern int ns_get_name(char *buf, size_t size, struct task_struct *task, const struct proc_ns_operations *ns_ops); > > Up farther in the stack it should be something like: > >> BPF_CALL_2(bpf_current_pidns_match, dev_t *dev, ino_t *ino) >> { >> return ns_match(&task_active_pid_ns(current)->ns, *dev, *ino); >> } >> >> const struct bpf_func_proto bpf_current_pidns_match_proto = { >> .func = bpf_current_pins_match, >> .gpl_only = true, >> .ret_type = RET_INTEGER >> .arg1_type = ARG_PTR_TO_DEVICE_NUMBER, >> .arg2_type = ARG_PTR_TO_INODE_NUMBER, >> }; > > That allows comparing what the bpf came up with with whatever value > userspace generated by stating the file descriptor. > > > That is the least bad suggestion I currently have for that > functionality. It really would be better to not have that filter in the > bpf program itself but in the infrastructure that binds a program to a > set of tasks. > > The problem with this approach is whatever device/inode you have when > the namespace they refer to exits there is the possibility that the > inode will be reused. So your filter will eventually start matching on > the wrong thing. I come up with a differeent helper definition, which is much more similar to existing bpf_get_current_pid_tgid() and helper definition much more conforms to bpf convention. diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index 5e28718928ca..bc26903c80c7 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -11,6 +11,8 @@ #include <linux/uidgid.h> #include <linux/filter.h> #include <linux/ctype.h> +#include <linux/pid_namespace.h> +#include <linux/proc_ns.h> #include "../../lib/kstrtox.h" @@ -487,3 +489,33 @@ const struct bpf_func_proto bpf_strtoul_proto = { .arg4_type = ARG_PTR_TO_LONG, }; #endif + +BPF_CALL_2(bpf_get_ns_current_pid_tgid, u32, dev, u32, inum) +{ + struct task_struct *task = current; + struct pid_namespace *pidns; + pid_t pid, tgid; + + if (unlikely(!task)) + return -EINVAL; + + + pidns = task_active_pid_ns(task); + if (unlikely(!pidns)) + return -ENOENT; + + if (!ns_match(&pidns->ns, (dev_t)dev, inum)) + return -EINVAL; + + pid = task_pid_nr_ns(task, pidns); + tgid = task_tgid_nr_ns(task, pidns); + + return (u64) tgid << 32 | pid; +} + +const struct bpf_func_proto bpf_get_ns_current_pid_tgid_proto = { + .func = bpf_get_ns_current_pid_tgid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_ANYTHING, + .arg2_type = ARG_ANYTHING, +}; Existing usage of bpf_get_current_pid_tgid() can be converted to bpf_get_ns_current_pid_tgid() if ns dev/inode number is supplied. For bpf_get_ns_current_pid_tgid(), checking return value ( < 0 or not) is needed. > > Eric >