Re: [RFC PATCH 3/3] misc_cgroup: remove error log to avoid log flood

Vipin Sharma <vipinsh@xxxxxxxxxx> · Fri, 10 Sep 2021 10:16:46 -0700

On Fri, Sep 10, 2021 at 9:29 AM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Fri, Sep 10, 2021 at 05:36:09PM +0200, Michal Koutnı wrote:
> > If there's a limit on certain level with otherwise unconstrained cgroup
> > structure below (a valid config too), the 'fail' counter would help
> > determining what the affected cgroup is. Does that make sense to you?
>
> While the desire to make the interface complete is understandable, I don't
> think we need to go too far in that direction given that debugging these
> configuration issues requires human intervention anyway and providing
> overall information is often enough of aid especially for simple controllers
> like misc/pid. So, let's stick to something consistent and simple even if
> not complete and definitely not name them "fail" even if we add them.
>

I understand what Michal is proposing regarding fail vs max and local
vs hierarchical. I think this will provide complete information but it
will be too many interfaces for a simple controller like misc and
might not even get used by anyone.

Chunguang's case was to avoid printing so many messages, I agree we
should remove the log message and add a file event.

For now, I think we can just have one file, events.local
(non-hierarchical) which has %s.max type entries. This will tell us
which cgroup is under pressure and I believe this is helpful.

Regarding the original cgroup which started the charge should be
easier to identify because those processes will not be able to proceed
or will use some alternate logic, and the job owner should be able to
notice it.

If in future there is a need to find the originating cgroup we can
resume this discussion during that time.