Re: [PATCH] memcg, oom: move out_of_memory back to the charge path

Greg Thelen <gthelen@xxxxxxxxxx> · Tue, 03 Jul 2018 16:29:00 -0700

Michal Hocko <mhocko@xxxxxxxxxx> wrote:

> On Tue 03-07-18 00:08:05, Greg Thelen wrote:
>> Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> 
>> > On Fri 29-06-18 11:59:04, Greg Thelen wrote:
>> >> Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> >> 
>> >> > On Thu 28-06-18 16:19:07, Greg Thelen wrote:
>> >> >> Michal Hocko <mhocko@xxxxxxxxxx> wrote:
>> >> > [...]
>> >> >> > +	if (mem_cgroup_out_of_memory(memcg, mask, order))
>> >> >> > +		return OOM_SUCCESS;
>> >> >> > +
>> >> >> > +	WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
>> >> >> > +		"This looks like a misconfiguration or a kernel bug.");
>> >> >> 
>> >> >> I'm not sure here if the warning should here or so strongly worded.  It
>> >> >> seems like the current task could be oom reaped with MMF_OOM_SKIP and
>> >> >> thus mem_cgroup_out_of_memory() will return false.  So there's nothing
>> >> >> alarming in that case.
>> >> >
>> >> > If the task is reaped then its charges should be released as well and
>> >> > that means that we should get below the limit. Sure there is some room
>> >> > for races but this should be still unlikely. Maybe I am just
>> >> > underestimating though.
>> >> >
>> >> > What would you suggest instead?
>> >> 
>> >> I suggest checking MMF_OOM_SKIP or deleting the warning.
>> >
>> > So what do you do when you have MMF_OOM_SKIP task? Do not warn? Checking
>> > for all the tasks would be quite expensive and remembering that from the
>> > task selection not nice either. Why do you think it would help much?
>> 
>> I assume we could just check current's MMF_OOM_SKIP - no need to check
>> all tasks.
>
> I still do not follow. If you are after a single task memcg then we
> should be ok. try_charge has a runaway for oom victims
> 	if (unlikely(tsk_is_oom_victim(current) ||
> 		     fatal_signal_pending(current) ||
> 		     current->flags & PF_EXITING))
> 		goto force;
>
> regardless of MMF_OOM_SKIP. So if there is a single process in the
> memcg, we kill it and the oom reaper kicks in and sets MMF_OOM_SKIP then
> we should bail out there. Or do I miss your intention?

For a single task memcg it seems that racing process cgroup migration
could trigger the new warning (I have attempted to reproduce this):

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]               Process B[M2]

  M1 is oom
  try_charge(M1)
                              Move A M1=>M2
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
    out_of_memory()
      select_bad_process()
        sees nothing in M1
      return 0
    return 0
  WARN()

Another variant might be possible, this time with global oom:

Processes A,B in memcg M1,M2.  M1 is oom.

  Process A[M1]               Process B[M2]

  try_charge()
                              trigger global oom
                              reaper sets A.MMF_OOM_SKIP
  mem_cgroup_oom()
  mem_cgroup_out_of_memory()
    out_of_memory()
      select_bad_process()
        sees nothing in M1
      return 0
    return 0
  WARN()

These seem unlikely, so I'm fine with taking a wait-and-see approach.