Re: [PATCH bpf-next v1 1/9] bpf: Add mkdir, rmdir, unlink syscalls for prog_bpf_syscall

Hao Luo <haoluo@xxxxxxxxxx> · Tue, 8 Mar 2022 13:08:39 -0800

On Sat, Mar 5, 2022 at 3:47 PM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Mar 4, 2022 at 10:37 AM Hao Luo <haoluo@xxxxxxxxxx> wrote:
> >
> > I gave this question more thought. We don't need to bind mount the top
> > bpffs into the container, instead, we may be able to overlay a bpffs
> > directory into the container. Here is the workflow in my mind:
>
> I don't quite follow what you mean by 'overlay' here.
> Another bpffs mount or future overlayfs that supports bpffs?
>
> > For each job, let's say A, the container runtime can create a
> > directory in bpffs, for example
> >
> >   /sys/fs/bpf/jobs/A
> >
> > and then create the cgroup for A. The sleepable tracing prog will
> > create the file:
> >
> >   /sys/fs/bpf/jobs/A/100/stats
> >
> > 100 is the created cgroup's id. Then the container runtime overlays
> > the bpffs directory into container A in the same path:
>
> Why cgroup id ? Wouldn't it be easier to use the same cgroup name
> as in cgroupfs ?
>

Cgroup name isn't unique. We don't need the hierarchy information of
cgroups. We can use a library function to translate cgroup path to
cgroup id. See the get_cgroup_id() in patch 9/9. It works fine in the
selftest.

> >   [A's container path]/sys/fs/bpf/jobs/A.
> >
> > A can see the stats at the path within its mount ns:
> >
> >   /sys/fs/bpf/jobs/A/100/stats
> >
> > When A creates cgroup, it is able to write to the top layer of the
> > overlayed directory. So it is
> >
> >   /sys/fs/bpf/jobs/A/101/stats
> >
> > Some of my thoughts:
> >   1. Compared to bind mount top bpffs into container, overlaying a
> > directory avoids exposing other jobs' stats. This gives better
> > isolation. I already have a patch for supporting laying bpffs over
> > other fs, it's not too hard.
>
> So it's overlayfs combination of bpffs and something like ext4, right?
> I thought you found out that overlaryfs has to be upper fs
> and lower fs shouldn't be modified underneath.
> So if bpffs is a lower fs the writes into it should go
> through the upper overlayfs, right?
>

It's overlayfs combining bpffs and ext4. Bpffs is the upper layer. The
lower layer is an empty ext4 directory. The merged directory is a
directory in the container.
The upper layer contains bpf objects that we want to expose to the
container, for example, the sleepable tracing progs and the iter link
for reading stats. Only the merged directory is visible to the
container and all the updates go through the merged directory.

The following is the example of workflow I'm thinking:

Step 1: We first set up directories and bpf objects needed by containers.

[# ~] ls /sys/fs/bpf/container/upper
tracing_prog   iter_link
[# ~] ls /sys/fs/bpf/container/work
[# ~] ls /container
root   lower
[# ~] ls /container/root
bpf
[# ~] ls /container/root/bpf

Step 2: Use overlayfs to mount a directory from bpffs into the container's home.

[# ~] mkdir /container/lower
[# ~] mkdir /sys/fs/bpf/container/workdir
[# ~] mount -t overlay overlay -o \
 lowerdir=/container/lower,\
 upperdir=/sys/fs/bpf/container/upper,\
 workdir=/sys/fs/bpf/container/work \
  /container/root/bpf
[# ~] ls /container/root/bpf
tracing_prog    iter_link

Step 3: pivot root for container, we expect to see the bpf objects are
mapped into container,

[# ~] chroot /container/root
[# ~] ls /
bpf
[# ~] ls /bpf
tracing_prog   iter_link

Note:

- I haven't tested Step 3. But Step 1 and step 2 seem to be working as
expected. I am testing the behaviors of the bpf objects, after we
enter the container.

- Only a directory in bpffs is mapped into the container, not the top
bpffs. The path is uniform in all containers, that is, /bpf. The
container should be able to mkdir in /bpf, etc.

> >   2. Once the container runtime has overlayed directory into the
> > container, it has no need to create more cgroups for this job. It
> > doesn't need to track the stats of job-created cgroups, which are
> > mainly for inspection by the job itself. Even if it needs to collect
> > the stats from those cgroups, it can read from the path in the
> > container.
> >   3. The overlay path in container doesn't have to be exactly the same
> > as the path in root mount ns. In the sleepable tracing prog, we may
> > select paths based on current process's ns. If we choose to do this,
> > we can further avoid exposing cgroup id and job name to the container.
>
> The benefits make sense.