Re: [PATCH 1/2] memcg, oom: unmark under_oom after the oom killer is done

Michal Hocko <mhocko@xxxxxxxx> · Wed, 27 Sep 2023 15:36:56 +0200

On Tue 26-09-23 22:39:11, Haifeng Xu wrote:
> 
> 
> On 2023/9/25 20:37, Michal Hocko wrote:
> > On Mon 25-09-23 20:28:02, Haifeng Xu wrote:
> >>
> >>
> >> On 2023/9/25 19:38, Michal Hocko wrote:
> >>> On Mon 25-09-23 17:03:05, Haifeng Xu wrote:
> >>>>
> >>>>
> >>>> On 2023/9/25 15:57, Michal Hocko wrote:
> >>>>> On Fri 22-09-23 07:05:28, Haifeng Xu wrote:
> >>>>>> When application in userland receives oom notification from kernel
> >>>>>> and reads the oom_control file, it's confusing that under_oom is 0
> >>>>>> though the omm killer hasn't finished. The reason is that under_oom
> >>>>>> is cleared before invoking mem_cgroup_out_of_memory(), so move the
> >>>>>> action that unmark under_oom after completing oom handling. Therefore,
> >>>>>> the value of under_oom won't mislead users.
> >>>>>
> >>>>> I do not really remember why are we doing it this way but trying to track
> >>>>> this down shows that we have been doing that since fb2a6fc56be6 ("mm:
> >>>>> memcg: rework and document OOM waiting and wakeup"). So this is an
> >>>>> established behavior for 10 years now. Do we really need to change it
> >>>>> now? The interface is legacy and hopefully no new workloads are
> >>>>> emerging.
> >>>>>
> >>>>> I agree that the placement is surprising but I would rather not change
> >>>>> that unless there is a very good reason for that. Do you have any actual
> >>>>> workload which depends on the ordering? And if yes, how do you deal with
> >>>>> timing when the consumer of the notification just gets woken up after
> >>>>> mem_cgroup_out_of_memory completes?
> >>>>
> >>>> yes, when the oom event is triggered, we check the under_oom every 10 seconds. If it
> >>>> is cleared, then we create a new process with less memory allocation to avoid oom again.
> >>>
> >>> OK, I do understand what you mean and I could have made myself
> >>> more clear previously. Even if the state is cleared _after_
> >>> mem_cgroup_out_of_memory then you won't get what you need I am
> >>> afraid. The memcg stays under OOM until a memory is freed (uncharged)
> >>> from that memcg. mem_cgroup_out_of_memory itself doesn't really free
> >>> any memory on its own. It relies on the task to wake up and die or
> >>> oom_reaper to do the work on its behalf. All of that is time dependent.
> >>> under_oom would have to be reimplemented to be cleared when a memory is
> >>> unchanrged to meet your demands. Something that has never really been
> >>> the semantic.
> >>>
> >>
> >> yes, but at least before we create the new process, it has more chance to get some memory freed.
> > 
> > The time window we are talking about is the call of
> > mem_cgroup_out_of_memory which, depending on the number of evaluated
> > processes, could be a very short time. So what kind of practical
> > difference does this have on your workload? Is this measurable in any
> > way.
> 
> The oom events in this group seems less than before.

Let me see if I follow. You are launching new workloads after oom
happens as soon as under_oom becomes 0. With the patch applied you see
fewer oom invocations which imlies that fewer re-launchings hit the
stil-under-oom situations? I would also expect that those are compared
over the same time period. Do you have any actual numbers to present?
Are they statistically representative?

I really have to say that I am skeptical over the presented usecase.
Optimizing over oom events seems just like a very wrong way to scale the
workload. Timing of oom handling is a subject to change at any time and
what you are optimizing for might change.

That being said, I do not see any obvious problem with the patch. IMO we
should rather not apply it because it is slighly changing a long term
behavior for something that is in a legacy mode now. But I will not Nack
it either as it is just a trivial thing. I just do not like an idea we
would be changing the timing of under_oom clearing just to fine tune
some workloads.

> >>> Btw. is this something new that you are developing on top of v1? And if
> >>> yes, why don't you use v2?
> >>>
> >>
> >> yes, v2 doesn't have the "cgroup.event_control" file.
> > 
> > Yes, it doesn't. But why is it necessary? Relying on v1 just for this is
> > far from ideal as v1 is deprecated and mostly frozen. Why do you need to
> > rely on the oom notifications (or oom behavior in general) in the first
> > place? Could you share more about your workload and your requirements?
> > 
> 
> for example, we want to run processes in the group but those parametes related to 
> memory allocation is hard to decide, so use the notifications to inform us that we
> need to adjust the paramters automatically and we don't need to create the new processes
> manually.

I do understand that but OOM is just way too late to tune anything
upon. Cgroup v2 has a notion of high limit which can throttle memory
allocations way before the hard limit is set and this along with PSI
metrics could give you a much better insight on the memory pressure
in a memcg.

-- 
Michal Hocko
SUSE Labs