On Wed, Mar 2, 2022 at 11:34 AM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Mon, Feb 28, 2022 at 02:10:39PM -0800, Hao Luo wrote: > > Hi Kumar, > > > > On Sat, Feb 26, 2022 at 9:18 PM Kumar Kartikeya Dwivedi > > <memxor@xxxxxxxxx> wrote: > > > > > > On Sat, Feb 26, 2022 at 05:13:31AM IST, Hao Luo wrote: > > > > This patch allows bpf_syscall prog to perform some basic filesystem > > > > operations: create, remove directories and unlink files. Three bpf > > > > helpers are added for this purpose. When combined with the following > > > > patches that allow pinning and getting bpf objects from bpf prog, > > > > this feature can be used to create directory hierarchy in bpffs that > > > > help manage bpf objects purely using bpf progs. > > > > > > > > The added helpers subject to the same permission checks as their syscall > > > > version. For example, one can not write to a read-only file system; > > > > The identity of the current process is checked to see whether it has > > > > sufficient permission to perform the operations. > > > > > > > > Only directories and files in bpffs can be created or removed by these > > > > helpers. But it won't be too hard to allow these helpers to operate > > > > on files in other filesystems, if we want. > > > > > > > > Signed-off-by: Hao Luo <haoluo@xxxxxxxxxx> > > > > --- > > > > + * > > > > + * long bpf_mkdir(const char *pathname, int pathname_sz, u32 mode) > > > > + * Description > > > > + * Attempts to create a directory name *pathname*. The argument > > > > + * *pathname_sz* specifies the length of the string *pathname*. > > > > + * The argument *mode* specifies the mode for the new directory. It > > > > + * is modified by the process's umask. It has the same semantic as > > > > + * the syscall mkdir(2). > > > > + * Return > > > > + * 0 on success, or a negative error in case of failure. > > > > + * > > > > + * long bpf_rmdir(const char *pathname, int pathname_sz) > > > > + * Description > > > > + * Deletes a directory, which must be empty. > > > > + * Return > > > > + * 0 on sucess, or a negative error in case of failure. > > > > + * > > > > + * long bpf_unlink(const char *pathname, int pathname_sz) > > > > + * Description > > > > + * Deletes a name and possibly the file it refers to. It has the > > > > + * same semantic as the syscall unlink(2). > > > > + * Return > > > > + * 0 on success, or a negative error in case of failure. > > > > */ > > > > > > > > > > How about only introducing bpf_sys_mkdirat and bpf_sys_unlinkat? That would be > > > more useful for other cases in future, and when AT_FDCWD is passed, has the same > > > functionality as these, but when openat/fget is supported, it would work > > > relative to other dirfds as well. It can also allow using dirfd of the process > > > calling read for a iterator (e.g. if it sets the fd number using skel->bss). > > > unlinkat's AT_REMOVEDIR flag also removes the need for a bpf_rmdir. > > > > > > WDYT? > > > > > > > The idea sounds good to me, more flexible. But I don't have a real use > > case for using the added 'dirfd' at this moment. For all the use cases > > I can think of, absolute paths will suffice, I think. Unless other > > reviewers have opposition, I will try switching to mkdirat and > > unlinkat in v2. > > I'm surprised you don't need "at" variants. > I thought your production setup has a top level cgroup controller and > then inner tasks inside containers manage cgroups on their own. > Since containers are involved they likely run inside their own mountns. > cgroupfs mount is single. So you probably don't even need to bind mount it > inside containers, but bpffs is not a single mount. You need > to bind mount top bpffs inside containers for tasks to access it. > Now for cgroupfs the abs path is not an issue, but for bpffs > the AT_FDCWD becomes a problem. AT_FDCWD is using current mount ns. > Inside container that will be different. Unless you bind mount into exact > same path the full path has different meanings inside and outside of the container. > It seems to me the bpf progs attached to cgroup sleepable events should > be using FD of bpffs. Then when these tracepoints are triggered from > different containers in different mountns they will get the right dir prefix. > What am I missing? > Alexei, you are perfectly right. To be honest, I failed to see the fact that the sleepable tp prog is in the container's mount ns. I think we can bind mount the top bpffs into exactly the same path inside container, right? But I haven't tested this idea. Passing FDs should be better. > I think non-AT variants are not needed. The prog can always pass AT_FDCWD > if it's really the intent, but passing actual FD seems more error-proof. Let's have the AT version. Passing FD seems the right approach, when we use it in the context of container.