On Sat, Mar 5, 2022 at 3:47 PM Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote: > > On Fri, Mar 4, 2022 at 10:37 AM Hao Luo <haoluo@xxxxxxxxxx> wrote: > > > > I gave this question more thought. We don't need to bind mount the top > > bpffs into the container, instead, we may be able to overlay a bpffs > > directory into the container. Here is the workflow in my mind: > > I don't quite follow what you mean by 'overlay' here. > Another bpffs mount or future overlayfs that supports bpffs? > > > For each job, let's say A, the container runtime can create a > > directory in bpffs, for example > > > > /sys/fs/bpf/jobs/A > > > > and then create the cgroup for A. The sleepable tracing prog will > > create the file: > > > > /sys/fs/bpf/jobs/A/100/stats > > > > 100 is the created cgroup's id. Then the container runtime overlays > > the bpffs directory into container A in the same path: > > Why cgroup id ? Wouldn't it be easier to use the same cgroup name > as in cgroupfs ? > Cgroup name isn't unique. We don't need the hierarchy information of cgroups. We can use a library function to translate cgroup path to cgroup id. See the get_cgroup_id() in patch 9/9. It works fine in the selftest. > > [A's container path]/sys/fs/bpf/jobs/A. > > > > A can see the stats at the path within its mount ns: > > > > /sys/fs/bpf/jobs/A/100/stats > > > > When A creates cgroup, it is able to write to the top layer of the > > overlayed directory. So it is > > > > /sys/fs/bpf/jobs/A/101/stats > > > > Some of my thoughts: > > 1. Compared to bind mount top bpffs into container, overlaying a > > directory avoids exposing other jobs' stats. This gives better > > isolation. I already have a patch for supporting laying bpffs over > > other fs, it's not too hard. > > So it's overlayfs combination of bpffs and something like ext4, right? > I thought you found out that overlaryfs has to be upper fs > and lower fs shouldn't be modified underneath. > So if bpffs is a lower fs the writes into it should go > through the upper overlayfs, right? > It's overlayfs combining bpffs and ext4. Bpffs is the upper layer. The lower layer is an empty ext4 directory. The merged directory is a directory in the container. The upper layer contains bpf objects that we want to expose to the container, for example, the sleepable tracing progs and the iter link for reading stats. Only the merged directory is visible to the container and all the updates go through the merged directory. The following is the example of workflow I'm thinking: Step 1: We first set up directories and bpf objects needed by containers. [# ~] ls /sys/fs/bpf/container/upper tracing_prog iter_link [# ~] ls /sys/fs/bpf/container/work [# ~] ls /container root lower [# ~] ls /container/root bpf [# ~] ls /container/root/bpf Step 2: Use overlayfs to mount a directory from bpffs into the container's home. [# ~] mkdir /container/lower [# ~] mkdir /sys/fs/bpf/container/workdir [# ~] mount -t overlay overlay -o \ lowerdir=/container/lower,\ upperdir=/sys/fs/bpf/container/upper,\ workdir=/sys/fs/bpf/container/work \ /container/root/bpf [# ~] ls /container/root/bpf tracing_prog iter_link Step 3: pivot root for container, we expect to see the bpf objects are mapped into container, [# ~] chroot /container/root [# ~] ls / bpf [# ~] ls /bpf tracing_prog iter_link Note: - I haven't tested Step 3. But Step 1 and step 2 seem to be working as expected. I am testing the behaviors of the bpf objects, after we enter the container. - Only a directory in bpffs is mapped into the container, not the top bpffs. The path is uniform in all containers, that is, /bpf. The container should be able to mkdir in /bpf, etc. > > 2. Once the container runtime has overlayed directory into the > > container, it has no need to create more cgroups for this job. It > > doesn't need to track the stats of job-created cgroups, which are > > mainly for inspection by the job itself. Even if it needs to collect > > the stats from those cgroups, it can read from the path in the > > container. > > 3. The overlay path in container doesn't have to be exactly the same > > as the path in root mount ns. In the sleepable tracing prog, we may > > select paths based on current process's ns. If we choose to do this, > > we can further avoid exposing cgroup id and job name to the container. > > The benefits make sense.