On Fri, Sep 10, 2021 at 9:29 AM Tejun Heo <tj@xxxxxxxxxx> wrote: > > Hello, > > On Fri, Sep 10, 2021 at 05:36:09PM +0200, Michal Koutnı wrote: > > If there's a limit on certain level with otherwise unconstrained cgroup > > structure below (a valid config too), the 'fail' counter would help > > determining what the affected cgroup is. Does that make sense to you? > > While the desire to make the interface complete is understandable, I don't > think we need to go too far in that direction given that debugging these > configuration issues requires human intervention anyway and providing > overall information is often enough of aid especially for simple controllers > like misc/pid. So, let's stick to something consistent and simple even if > not complete and definitely not name them "fail" even if we add them. > I understand what Michal is proposing regarding fail vs max and local vs hierarchical. I think this will provide complete information but it will be too many interfaces for a simple controller like misc and might not even get used by anyone. Chunguang's case was to avoid printing so many messages, I agree we should remove the log message and add a file event. For now, I think we can just have one file, events.local (non-hierarchical) which has %s.max type entries. This will tell us which cgroup is under pressure and I believe this is helpful. Regarding the original cgroup which started the charge should be easier to identify because those processes will not be able to proceed or will use some alternate logic, and the job owner should be able to notice it. If in future there is a need to find the originating cgroup we can resume this discussion during that time.