On Thu, Jul 14, 2022 at 12:24 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Wed, Jul 13, 2022 at 10:24:05PM +0800, Yafang Shao wrote: > > I have told you that it is not reasonable to refuse a containerized > > process to pin bpf programs, but if you are not familiar with k8s, it > > is not easy to explain clearly why it is a trouble for deployment. > > But I can try to explain to you from a *systemd user's* perspective. > > The way systemd currently sets up cgroup hierarchy doesn't work for > persistent per-service resource tracking. It needs to introduce an extra > layer for that which woudl be a significant change for systemd too. > > > I assume the above hierarchy is what you expect. > > But you know, in the k8s environment, everything is pod-based, that > > means if we use the above hierarchy in the k8s environment, the k8s's > > limiting, monitoring, debugging must be changed consequently. That > > means it may be a fullstack change in k8s, a great refactor. > > > > So below hierarchy is a reasonable solution, > > bpf-memcg > > | > > bpf-foo pod bpf-foo-memcg (limited) > > / \ / > > (charge) (not-charged) (charged) > > proc-foo bpf-foo > > > > And then keep the bpf-memgs persistent. > > It looks like you draw the diagram with variable width font and it's > difficult to tell what you're trying to say. Maybe below diagram is more clear to you ? bpf-memcg | bpf-foo pod bpf-foo-memcg (limited) / \ / (charge) (not-charged) (charged) | \ / | \ / proc-foo bpf-foo bpf-foo is loaded by process-foo, but it is not charge to the bpf-foo pod, while it is remotely charge to bpf-foo-memcg. > That said, I don't think the > argument you're making is a good one in general. The topic at hand is future > architectural direction in handling shared resources, which was never well > supported before. ie. We're not talking about breaking existing behaviors. > > We don't want to architect kernel features to suit the expectations of one > particular application. It has to be longer term than that and it can't be > an one way road. Sometimes the kernel adapts to existing applications > because the expectations make sense. At other times, kernel takes a > direction which may require some work from applications to use new > capabilities because that makes more sense in the long term. > The shared resources or remote charge is not a new issue, see also task->active_memcg. The case (map->memcg or map->objcg) we are handling now is similar with task->active_memcg. If we want to make it generic, I think we can start with task->active_memcg. To make it generic, I have some superficial thinking on the cgroup side. 1) Can we extend the cgroup tree to cgroup graph ? 2) Can we extend the cgroup from process-based (cgroup.procs) to resource-based (cgroup.resources) ? Regarding question 1). Originally the charge direction is vertical, looks like a tree, as below, parent ^ | cgroup But after the task->active_memcg, there's a newly horizontal charge, as below, parent ^ | cgroup ----> friend They will have a same ancestor, so finally it looks like a graph, ancestor / \ ... ... / \ cgroup ---- friend Regarding question 2). The lifecycle of a leaf cgroup is same with the processes inside it. But after the remote charge been introduced, the lifecycle of a leaf cgroup may be same with the process in other cgroups. That said, it is not sufficient to be treated as process-based, because what it really care about is the resources, so may be we should extend it to resource-based. > Let's keep the discussion more focused on technical merits. > > Thanks. > > -- > tejun -- Regards Yafang