Re: [PATCH memcg 3/3] memcg: handle memcg oom failures

Vasily Averin <vvs@xxxxxxxxxxxxx> · Thu, 21 Oct 2021 18:05:28 +0300

On 21.10.2021 14:49, Michal Hocko wrote:
> I do understand that handling a very specific case sounds easier but it
> would be better to have a robust fix even if that requires some more
> head scratching. So far we have collected several reasons why the it is
> bad to trigger oom killer from the #PF path. There is no single argument
> to keep it so it sounds like a viable path to pursue. Maybe there are
> some very well hidden reasons but those should be documented and this is
> a great opportunity to do either of the step.
> 
> Moreover if it turns out that there is a regression then this can be
> easily reverted and a different, maybe memcg specific, solution can be
> implemented.

Now I'm agree,
however I still have a few open questions.

1) VM_FAULT_OOM may be triggered w/o execution of out_of_memory()
for exampel it can be caused by incorrect vm fault operations, 
(a) which can return this error without calling allocator at all.
(b) or which can provide incorrect gfp flags and allocator can fail without execution of out_of_memory.
(c) This may happen on stable/LTS kernels when successful allocation was failed by hit into limit of legacy memcg-kmem contoller.
We'll drop it in upstream kernels, however how to handle it in old kenrels?

We can make sure that out_of_memory or alocator was called by set of some per-task flags.

Can pagefault_out_of_memory() send itself a SIGKILL in all these cases?

If not -- task will be looped. 
It is much better than execution of global OOM, however it would be even better to avoid it somehow.

You said: "We cannot really kill the task if we could we would have done it by the oom killer already".
However what to do if we even not tried to use oom-killer? (see (b) and (c)) 
or if we did not used the allocator at all (see (a))

2) in your patch we just exit from pagefault_out_of_memory(). and restart new #PF.
We can call schedule_timeout() and wait some time before a new #PF restart.
Additionally we can increase this delay in each new cycle. 
It helps to save CPU time for other tasks.
What do you think about?

Thank you,
	Vasily Averin