On Fri, Mar 4, 2022 at 10:37 AM Hao Luo <haoluo@xxxxxxxxxx> wrote: > > I gave this question more thought. We don't need to bind mount the top > bpffs into the container, instead, we may be able to overlay a bpffs > directory into the container. Here is the workflow in my mind: I don't quite follow what you mean by 'overlay' here. Another bpffs mount or future overlayfs that supports bpffs? > For each job, let's say A, the container runtime can create a > directory in bpffs, for example > > /sys/fs/bpf/jobs/A > > and then create the cgroup for A. The sleepable tracing prog will > create the file: > > /sys/fs/bpf/jobs/A/100/stats > > 100 is the created cgroup's id. Then the container runtime overlays > the bpffs directory into container A in the same path: Why cgroup id ? Wouldn't it be easier to use the same cgroup name as in cgroupfs ? > [A's container path]/sys/fs/bpf/jobs/A. > > A can see the stats at the path within its mount ns: > > /sys/fs/bpf/jobs/A/100/stats > > When A creates cgroup, it is able to write to the top layer of the > overlayed directory. So it is > > /sys/fs/bpf/jobs/A/101/stats > > Some of my thoughts: > 1. Compared to bind mount top bpffs into container, overlaying a > directory avoids exposing other jobs' stats. This gives better > isolation. I already have a patch for supporting laying bpffs over > other fs, it's not too hard. So it's overlayfs combination of bpffs and something like ext4, right? I thought you found out that overlaryfs has to be upper fs and lower fs shouldn't be modified underneath. So if bpffs is a lower fs the writes into it should go through the upper overlayfs, right? > 2. Once the container runtime has overlayed directory into the > container, it has no need to create more cgroups for this job. It > doesn't need to track the stats of job-created cgroups, which are > mainly for inspection by the job itself. Even if it needs to collect > the stats from those cgroups, it can read from the path in the > container. > 3. The overlay path in container doesn't have to be exactly the same > as the path in root mount ns. In the sleepable tracing prog, we may > select paths based on current process's ns. If we choose to do this, > we can further avoid exposing cgroup id and job name to the container. The benefits make sense.