Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall

Hao Luo <haoluo@xxxxxxxxxx> · Thu, 3 Mar 2022 10:50:48 -0800

On Wed, Mar 2, 2022 at 11:34 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Mon, Feb 28, 2022 at 02:10:39PM -0800, Hao Luo wrote:
> > Hi Kumar,
> >
> > On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi
> > <memxor@xxxxxxxxx> wrote:
> > >
> > > On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote:
> > > > This patch allows bpf_syscall prog to perform some basic filesystem
> > > > operations: create, remove directories and unlink files. Three bpf
> > > > helpers are added for this purpose. When combined with the following
> > > > patches that allow pinning and getting bpf objects from bpf prog,
> > > > this feature can be used to create directory hierarchy in bpffs that
> > > > help manage bpf objects purely using bpf progs.
> > > >
> > > > The added helpers subject to the same permission checks as their syscall
> > > > version. For example, one can not write to a read-only file system;
> > > > The identity of the current process is checked to see whether it has
> > > > sufficient permission to perform the operations.
> > > >
> > > > Only directories and files in bpffs can be created or removed by these
> > > > helpers. But it won't be too hard to allow these helpers to operate
> > > > on files in other filesystems, if we want.
> > > >
> > > > Signed-off-by: Hao Luo <haoluo@xxxxxxxxxx>
> > > > ---
> > > > + *
> > > > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode)
> > > > + *   Description
> > > > + *           Attempts to create a directory name *pathname*. The argument
> > > > + *           *pathname_sz* specifies the length of the string *pathname*.
> > > > + *           The argument *mode* specifies the mode for the new directory. It
> > > > + *           is modified by the process's umask. It has the same semantic as
> > > > + *           the syscall mkdir(2).
> > > > + *   Return
> > > > + *           0 on success, or a negative error in case of failure.
> > > > + *
> > > > + * long bpf_rmdir(const char *pathname, int pathname_sz)
> > > > + *   Description
> > > > + *           Deletes a directory, which must be empty.
> > > > + *   Return
> > > > + *           0 on sucess, or a negative error in case of failure.
> > > > + *
> > > > + * long bpf_unlink(const char *pathname, int pathname_sz)
> > > > + *   Description
> > > > + *           Deletes a name and possibly the file it refers to. It has the
> > > > + *           same semantic as the syscall unlink(2).
> > > > + *   Return
> > > > + *           0 on success, or a negative error in case of failure.
> > > >   */
> > > >
> > >
> > > How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be
> > > more useful for other cases in future, and when AT_FDCWD is passed, has the same
> > > functionality as these, but when openat/fget is supported, it would work
> > > relative to other dirfds as well. It can also allow using dirfd of the process
> > > calling read for a iterator (e.g. if it sets the fd number using skel->bss).
> > > unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir.
> > >
> > > WDYT?
> > >
> >
> > The idea sounds good to me, more flexible. But I don't have a real use
> > case for using the added 'dirfd' at this moment. For all the use cases
> > I can think of, absolute paths will suffice, I think. Unless other
> > reviewers have opposition, I will try switching to mkdirat and
> > unlinkat in v2.
>
> I'm surprised you don't need "at" variants.
> I thought your production setup has a top level cgroup controller and
> then inner tasks inside containers manage cgroups on their own.
> Since containers are involved they likely run inside their own mountns.
> cgroupfs mount is single. So you probably don't even need to bind mount it
> inside containers, but bpffs is not a single mount. You need
> to bind mount top bpffs inside containers for tasks to access it.
> Now for cgroupfs the abs path is not an issue, but for bpffs
> the AT_FDCWD becomes a problem. AT_FDCWD is using current mount ns.
> Inside container that will be different. Unless you bind mount into exact
> same path the full path has different meanings inside and outside of the container.
> It seems to me the bpf progs attached to cgroup sleepable events should
> be using FD of bpffs. Then when these tracepoints are triggered from
> different containers in different mountns they will get the right dir prefix.
> What am I missing?
>

Alexei, you are perfectly right. To be honest, I failed to see the
fact that the sleepable tp prog is in the container's mount ns. I
think we can bind mount the top bpffs into exactly the same path
inside container, right? But I haven't tested this idea. Passing FDs
should be better.

> I think non-AT variants are not needed. The prog can always pass AT_FDCWD
> if it's really the intent, but passing actual FD seems more error-proof.

Let's have the AT version. Passing FD seems the right approach, when
we use it in the context of container.