Re: [PATCH] oom_reaper: close race without using oom_lock

Michal Hocko <mhocko@xxxxxxxxxx> · Mon, 7 Aug 2017 08:02:43 +0200

On Sat 05-08-17 10:02:55, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 26-07-17 20:33:21, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > On Sun 23-07-17 09:41:50, Tetsuo Handa wrote:
> > > > > So, how can we verify the above race a real problem?
> > > > 
> > > > Try to simulate a _real_ workload and see whether we kill more tasks
> > > > than necessary. 
> > > 
> > > Whether it is a _real_ workload or not cannot become an answer.
> > > 
> > > If somebody is trying to allocate hundreds/thousands of pages after memory of
> > > an OOM victim was reaped, avoiding this race window makes no sense; next OOM
> > > victim will be selected anyway. But if somebody is trying to allocate only one
> > > page and then is planning to release a lot of memory, avoiding this race window
> > > can save somebody from being OOM-killed needlessly. This race window depends on
> > > what the threads are about to do, not whether the workload is natural or
> > > artificial.
> > 
> > And with a desparate lack of crystal ball we cannot do much about that
> > really.
> > 
> > > My question is, how can users know it if somebody was OOM-killed needlessly
> > > by allowing MMF_OOM_SKIP to race.
> > 
> > Is it really important to know that the race is due to MMF_OOM_SKIP?
> 
> Yes, it is really important. Needlessly selecting even one OOM victim is
> a pain which is difficult to explain to and persuade some of customers.

How is this any different from a race with a task exiting an releasing
some memory after we have crossed the point of no return and will kill
something?

> > Isn't it sufficient to see that we kill too many tasks and then debug it
> > further once something hits that?
> 
> It is not sufficient.
> 
> > 
> > [...]
> > > Is it guaranteed that __node_reclaim() never (even indirectly) waits for
> > > __GFP_DIRECT_RECLAIM && !__GFP_NORETRY memory allocation?
> > 
> > this is a direct reclaim which can go down to slab shrinkers with all
> > the usual fun...
> 
> Excuse me, but does that mean "Yes, it is" ?
> 
> As far as I checked, most shrinkers use non-scheduling operations other than
> cond_resched(). But some shrinkers use lock_page()/down_write() etc. I worry
> that such shrinkers might wait for __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
> memory allocation (i.e. "No, it isn't").

Yes that is possible. Once you are in the shrinker land then you have to
count with everything. And if you want to imply that
get_page_from_freelist inside __alloc_pages_may_oom may lockup while
holding the oom_lock then you might be right but I haven't checked that
too deeply. It might be very well possible that the node reclaim bails
out early when we are under OOM.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>