Re: [PATCH bpf-next v10 2/4] bpf: new helper to obtain namespace data from current task New bpf helper bpf_get_current_pidns_info.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Carlos,

Discussed with Eric today for what is the best way to get
the device number for a namespace. The following patch seems
a reasonable start although Eric would like to see
how the helper is used in order to decide whether the
interface looks right.

commit bb00fc36d5d263047a8bceb3e51e969d7fbce7db (HEAD -> fs2)
Author: Yonghong Song <yhs@xxxxxx>
Date:   Mon Sep 9 21:50:51 2019 -0700

     nsfs: add an interface function ns_get_inum_dev()

     This patch added an interface function
     ns_get_inum_dev(). Given a ns_common structure,
     the function returns the inode and device
     numbers. The function will be used later
     by a newly added bpf helper.

     Signed-off-by: Yonghong Song <yhs@xxxxxx>

diff --git a/fs/nsfs.c b/fs/nsfs.c
index a0431642c6b5..a603c6fc3f54 100644
--- a/fs/nsfs.c
+++ b/fs/nsfs.c
@@ -245,6 +245,14 @@ struct file *proc_ns_fget(int fd)
         return ERR_PTR(-EINVAL);
  }

+/* Get the device number for the current task pidns.
+ */
+void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev)
+{
+       *inum = ns->inum;
+       *dev = nsfs_mnt->mnt_sb->s_dev;
+}
+
  static int nsfs_show_path(struct seq_file *seq, struct dentry *dentry)
  {
         struct inode *inode = d_inode(dentry);
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..b8fc680cdf1a 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -81,6 +81,7 @@ extern void *ns_get_path(struct path *path, struct 
task_struct *task,
  typedef struct ns_common *ns_get_path_helper_t(void *);
  extern void *ns_get_path_cb(struct path *path, ns_get_path_helper_t 
ns_get_cb,
                             void *private_data);
+extern void ns_get_inum_dev(struct ns_common *ns, u32 *inum, dev_t *dev);

  extern int ns_get_name(char *buf, size_t size, struct task_struct *task,
                         const struct proc_ns_operations *ns_ops);

Could you put the above change and patch #1 and then have
all your other patches? In your kernel change, please use
interface function ns_get_inum_dev() to get pidns inode number
and dev number.

On 9/9/19 6:45 PM, Carlos Antonio Neira Bustos wrote:
> Thanks a lot, Al Viro and Yonghong for taking the time to review this patch and
> provide technical insights needed on this one.
> But how do we move this forward?
> Al Viro's review is clear that this will not work and we should strip the name
> resolution code (thanks for your detailed analysis).
> As there is currently only one instance of the nsfs device on the system,
> I think we could leave out the retrieval of the pidns device number and address it
> when the situation changes.
> What do you think?
> 
> 
> On Sat, Sep 07, 2019 at 06:34:39AM +0000, Yonghong Song wrote:
>>
>>
>> On 9/6/19 5:10 PM, Al Viro wrote:
>>> On Fri, Sep 06, 2019 at 11:21:14PM +0000, Yonghong Song wrote:
>>>
>>>> -bash-4.4$ readlink /proc/self/ns/pid
>>>> pid:[4026531836]
>>>> -bash-4.4$ stat /proc/self/ns/pid
>>>>      File: ‘/proc/self/ns/pid’ -> ‘pid:[4026531836]’
>>>>      Size: 0               Blocks: 0          IO Block: 1024   symbolic link
>>>> Device: 4h/4d   Inode: 344795989   Links: 1
>>>> Access: (0777/lrwxrwxrwx)  Uid: (128203/     yhs)   Gid: (  100/   users)
>>>> Context: user_u:base_r:base_t
>>>> Access: 2019-09-06 16:06:09.431616380 -0700
>>>> Modify: 2019-09-06 16:06:09.431616380 -0700
>>>> Change: 2019-09-06 16:06:09.431616380 -0700
>>>>     Birth: -
>>>> -bash-4.4$
>>>>
>>>> Based on a discussion with Eric Biederman back in 2019 Linux
>>>> Plumbers, Eric suggested that to uniquely identify a
>>>> namespace, device id (major/minor) number should also
>>>> be included. Although today's kernel implementation
>>>> has the same device for all namespace pseudo files,
>>>> but from uapi perspective, device id should be included.
>>>>
>>>> That is the reason why we try to get device id which holds
>>>> pid namespace pseudo file.
>>>>
>>>> Do you have a better suggestion on how to get
>>>> the device id for 'current' pid namespace? Or from design, we
>>>> really should not care about device id at all?
>>>
>>> What the hell is "device id for pid namespace"?  This is the
>>> first time I've heard about that mystery object, so it's
>>> hard to tell where it could be found.
>>>
>>> I can tell you what device numbers are involved in the areas
>>> you seem to be looking in.
>>>
>>> 1) there's whatever device number that gets assigned to
>>> (this) procfs instance.  That, ironically, _is_ per-pidns, but
>>> that of the procfs instance, not that of your process (and
>>> those can be different).  That's what you get in ->st_dev
>>> when doing lstat() of anything in /proc (assuming that
>>> procfs is mounted there, in the first place).  NOTE:
>>> that's lstat(2), not stat(2).  stat(1) uses lstat(2),
>>> unless given -L (in which case it's stat(2) time).  The
>>> difference:
>>>
>>> root@kvm1:~# stat /proc/self/ns/pid
>>>     File: /proc/self/ns/pid -> pid:[4026531836]
>>>     Size: 0               Blocks: 0          IO Block: 1024   symbolic link
>>> Device: 4h/4d   Inode: 17396       Links: 1
>>> Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Access: 2019-09-06 19:43:11.871312319 -0400
>>> Modify: 2019-09-06 19:43:11.871312319 -0400
>>> Change: 2019-09-06 19:43:11.871312319 -0400
>>>    Birth: -
>>> root@kvm1:~# stat -L /proc/self/ns/pid
>>>     File: /proc/self/ns/pid
>>>     Size: 0               Blocks: 0          IO Block: 4096   regular empty file
>>> Device: 3h/3d   Inode: 4026531836  Links: 1
>>> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
>>> Access: 2019-09-06 19:43:15.955313293 -0400
>>> Modify: 2019-09-06 19:43:15.955313293 -0400
>>> Change: 2019-09-06 19:43:15.955313293 -0400
>>>    Birth: -
>>>
>>> The former is lstat, the latter - stat.
>>>
>>> 2) device number of the filesystem where the symlink target lives.
>>> In this case, it's nsfs and there's only one instance on the entire
>>> system.  _That_ would be obtained by looking at st_dev in stat(2) on
>>> /proc/self/ns/pid (0:3 above).
>>>
>>> 3) device number *OF* the symlink.  That would be st_rdev in lstat(2).
>>> There's none - it's a symlink, not a character or block device.  It's
>>> always zero and always will be zero.
>>>
>>> 4) the same for the target; st_rdev in stat(2) results and again,
>>> there's no such beast - it's neither character nor block device.
>>>
>>> Your code is looking at (3).  Please, reread any textbook on Unix
>>> in the section that would cover stat(2) and discussion of the
>>> difference between st_dev and st_rdev.
>>>
>>> I have no idea what Eric had been talking about - it's hard to
>>> reconstruct by what you said so far.  Making nsfs per-userns,
>>> perhaps?  But that makes no sense whatsoever, not that userns
>>> ever had...  Cheap shots aside, I really can't guess what that's
>>> about.  Sorry.
>>
>> Thanks for the detailed information. The device number we want
>> is nsfs. Indeed, currently, there is only one instance
>> on the entire system. But not exactly sure what is the possibility
>> to have more than one nsfs device in the future. Maybe per-userns
>> or any other criteria?
>>
>>>
>>> In any case, pathname resolution is *NOT* for the situations where
>>> you can't block.  Even if it's procfs (and from the same pidns as
>>> the process) mounted there, there is no promise that the target
>>> of /proc/self has already been looked up and not evicted from
>>> memory since then.  And in case of cache miss pathwalk will
>>> have to call ->lookup(), which requires locking the directory
>>> (rw_sem, shared).  You can't do that in such context.
>>>
>>> And that doesn't even go into the possibility that process has
>>> something very different mounted on /proc.
>>>
>>> Again, I don't know what it is that you want to get to, but
>>> I would strongly recommend finding a way to get to that data
>>> that would not involve going anywhere near pathname resolution.
>>>
>>> How would you expect the userland to work with that value,
>>> whatever it might be?  If it's just a 32bit field that will
>>> never be read, you might as well store there the same value
>>> you store now (0, that is) in much cheaper and safer way ;-)
>>
>> Suppose inside pid namespace, user can pass the device number,
>> say n1, (`stat -L /proc/self/ns/pid`) to bpf program (through map
>> or JIT). At runtime, bpf program will try to get device number,
>> say n2, for the 'current' process. If n1 is not the same as
>> n2, that means they are not in the same namespace. 'current'
>> is in the same pid namespace as the user iff
>> n1 == n2 and also pidns id is the same for 'current' and
>> the one with `lsns -t pid`.
>>
>> Are you aware of any way to get the pidns device number
>> for 'current' without going through the pathname
>> lookup?
>>




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux