On Thu 03-08-17 10:37:01, Mel Gorman wrote: > On Wed, Aug 02, 2017 at 10:29:14AM +0200, Michal Hocko wrote: > > For ages we have been relying on TIF_MEMDIE thread flag to mark OOM > > victims and then, among other things, to give these threads full > > access to memory reserves. There are few shortcomings of this > > implementation, though. > > > > I believe the original intent was that allocations from the exit path > would be small and relatively short-lived. Yes. > > > First of all and the most serious one is that the full access to memory > > reserves is quite dangerous because we leave no safety room for the > > system to operate and potentially do last emergency steps to move on. > > > > Secondly this flag is per task_struct while the OOM killer operates > > on mm_struct granularity so all processes sharing the given mm are > > killed. Giving the full access to all these task_structs could lead to > > a quick memory reserves depletion. > > This is a valid concern. > > > We have tried to reduce this risk by > > giving TIF_MEMDIE only to the main thread and the currently allocating > > task but that doesn't really solve this problem while it surely opens up > > a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside > > the allocator without access to memory reserves because a particular > > thread was not the group leader. > > > > Now that we have the oom reaper and that all oom victims are reapable > > after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the > > kthreads") we can be more conservative and grant only partial access to > > memory reserves because there are reasonable chances of the parallel > > memory freeing. We still want some access to reserves because we do not > > want other consumers to eat up the victim's freed memory. oom victims > > will still contend with __GFP_HIGH users but those shouldn't be so > > aggressive to starve oom victims completely. > > > > Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to > > the half of the reserves. This makes the access to reserves independent > > on which task has passed through mark_oom_victim. Also drop any > > usage of TIF_MEMDIE from the page allocator proper and replace it by > > tsk_is_oom_victim as well which will make page_alloc.c completely > > TIF_MEMDIE free finally. > > > > CONFIG_MMU=n doesn't have oom reaper so let's stick to the original > > ALLOC_NO_WATERMARKS approach. > > > > There is a demand to make the oom killer memcg aware which will imply > > many tasks killed at once. This change will allow such a usecase without > > worrying about complete memory reserves depletion. > > > > Changes since v1 > > - do not play tricks with nommu and grant access to memory reserves as > > long as TIF_MEMDIE is set > > - break out from allocation properly for oom victims as per Tetsuo > > - distinguish oom victims from other consumers of memory reserves in > > __gfp_pfmemalloc_flags - per Tetsuo > > > > Signed-off-by: Michal Hocko <mhocko@xxxxxxxx> > > > > diff --git a/mm/internal.h b/mm/internal.h > > index 24d88f084705..1ebcb1ed01b5 100644 > > --- a/mm/internal.h > > +++ b/mm/internal.h > > @@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone, > > /* Mask to get the watermark bits */ > > #define ALLOC_WMARK_MASK (ALLOC_NO_WATERMARKS-1) > > > > +/* > > + * Only MMU archs have async oom victim reclaim - aka oom_reaper so we > > + * cannot assume a reduced access to memory reserves is sufficient for > > + * !MMU > > + */ > > +#ifdef CONFIG_MMU > > +#define ALLOC_OOM 0x08 > > +#else > > +#define ALLOC_OOM ALLOC_NO_WATERMARKS > > +#endif > > + > > #define ALLOC_HARDER 0x10 /* try to alloc harder */ > > #define ALLOC_HIGH 0x20 /* __GFP_HIGH set */ > > #define ALLOC_CPUSET 0x40 /* check for correct cpuset */ > > Ok, no collision with the wmark indexes so that should be fine. While I > didn't check, I suspect that !MMU users also have relatively few CPUs to > allow major contention. Well, I didn't try to improve the !MMU case because a) I do not know whether there is a real problem with oom depletion there and b) I have no way to test this. So I only focused on keeping the status quo for nommu. [...] > > @@ -2943,10 +2943,19 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, > > * the high-atomic reserves. This will over-estimate the size of the > > * atomic reserve but it avoids a search. > > */ > > - if (likely(!alloc_harder)) > > + if (likely(!alloc_harder)) { > > free_pages -= z->nr_reserved_highatomic; > > - else > > - min -= min / 4; > > + } else { > > + /* > > + * OOM victims can try even harder than normal ALLOC_HARDER > > + * users > > + */ > > + if (alloc_flags & ALLOC_OOM) > > + min -= min / 2; > > + else > > + min -= min / 4; > > + } > > + > > > > #ifdef CONFIG_CMA > > /* If allocation can't use CMA areas don't use free CMA pages */ > > I see no problem here although the comment could be slightly better. It > suggests at OOM victims can try harder but not why > > /* > * OOM victims can try even harder than normal ALLOC_HARDER users on the > * grounds that it's definitely going to be in the exit path shortly and > * free memory. Any allocation it makes during the free path will be > * small and short-lived. > */ OK, I have replaced the coment. > > @@ -3603,21 +3612,46 @@ gfp_to_alloc_flags(gfp_t gfp_mask) > > return alloc_flags; > > } > > > > -bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > +static bool oom_reserves_allowed(struct task_struct *tsk) > > { > > - if (unlikely(gfp_mask & __GFP_NOMEMALLOC)) > > + if (!tsk_is_oom_victim(tsk)) > > + return false; > > + > > + /* > > + * !MMU doesn't have oom reaper so give access to memory reserves > > + * only to the thread with TIF_MEMDIE set > > + */ > > + if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE)) > > return false; > > > > + return true; > > +} > > + > > Ok, there is a chance that a task selected as an OOM kill victim may be > in the middle of a __GFP_NOMEMALLOC allocation but I can't actually see a > problem wiith that. __GFP_NOMEMALLOC users are not going to be in the exit > path (which we care about for an OOM killed task) and the caller should > always be able to handle a failure. Not sure I understand. If the oom victim is doing __GFP_NOMEMALLOC then we haven't been doing ALLOC_NO_WATERMARKS even before. So I preserve the behavior here. Even though I am not sure this is a deliberate behavior or something more of result of an evolution of the code. > > +/* > > + * Distinguish requests which really need access to full memory > > + * reserves from oom victims which can live with a portion of it > > + */ > > +static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask) > > +{ > > + if (unlikely(gfp_mask & __GFP_NOMEMALLOC)) > > + return 0; > > if (gfp_mask & __GFP_MEMALLOC) > > - return true; > > + return ALLOC_NO_WATERMARKS; > > if (in_serving_softirq() && (current->flags & PF_MEMALLOC)) > > - return true; > > - if (!in_interrupt() && > > - ((current->flags & PF_MEMALLOC) || > > - unlikely(test_thread_flag(TIF_MEMDIE)))) > > - return true; > > + return ALLOC_NO_WATERMARKS; > > + if (!in_interrupt()) { > > + if (current->flags & PF_MEMALLOC) > > + return ALLOC_NO_WATERMARKS; > > + else if (oom_reserves_allowed(current)) > > + return ALLOC_OOM; > > + } > > > > - return false; > > + return 0; > > +} > > + > > +bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) > > +{ > > + return __gfp_pfmemalloc_flags(gfp_mask) > 0; > > } > > Very subtle sign casing error here. If the flags ever use the high bit, > this wraps and fails. It "shouldn't be possible" but you could just remove > the "> 0" there to be on the safe side or have __gfp_pfmemalloc_flags > return unsigned. what about return !!__gfp_pfmemalloc_flags(gfp_mask); > > /* > > @@ -3770,6 +3804,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > unsigned long alloc_start = jiffies; > > unsigned int stall_timeout = 10 * HZ; > > unsigned int cpuset_mems_cookie; > > + int reserves; > > > > This should be explicitly named to indicate it's about flags and not the > number of reserve pages or something else wacky. s@reserves@reserve_flags@? > > /* > > * In the slowpath, we sanity check order to avoid ever trying to > > @@ -3875,15 +3910,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > if (gfp_mask & __GFP_KSWAPD_RECLAIM) > > wake_all_kswapds(order, ac); > > > > - if (gfp_pfmemalloc_allowed(gfp_mask)) > > - alloc_flags = ALLOC_NO_WATERMARKS; > > + reserves = __gfp_pfmemalloc_flags(gfp_mask); > > + if (reserves) > > + alloc_flags = reserves; > > > > And if it's reserve_flags you can save a branch with > > reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); > alloc_pags |= reserve_flags; > > It won't make much difference considering how branch-intensive the allocator > is anyway. I was actually considering that but rather didn't want to do it because I wanted to reset alloc_flags rather than create strange ALLOC_$FOO combinations which would be harder to evaluate. > > /* > > * Reset the zonelist iterators if memory policies can be ignored. > > * These allocations are high priority and system rather than user > > * orientated. > > */ > > - if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) { > > + if (!(alloc_flags & ALLOC_CPUSET) || reserves) { > > ac->zonelist = node_zonelist(numa_node_id(), gfp_mask); > > ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, > > ac->high_zoneidx, ac->nodemask); > > @@ -3960,8 +3996,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > goto got_pg; > > > > /* Avoid allocations with no watermarks from looping endlessly */ > > - if (test_thread_flag(TIF_MEMDIE) && > > - (alloc_flags == ALLOC_NO_WATERMARKS || > > + if (tsk_is_oom_victim(current) && > > + (alloc_flags == ALLOC_OOM || > > (gfp_mask & __GFP_NOMEMALLOC))) > > goto nopage; > > > > Mostly I only found nit-picks so whether you address them or not > > Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> Thanks a lot for your review. Here is an incremental diff on top --- commit 8483306e108a25441d796a8da3df5b85d633d859 Author: Michal Hocko <mhocko@xxxxxxxx> Date: Thu Aug 3 12:58:49 2017 +0200 fold me "mm, oom: do not rely on TIF_MEMDIE for memory reserves access" - clarify access to memory reserves in __zone_watermark_ok - per Mel - make int->bool conversion in gfp_pfmemalloc_allowed more robust - per Mel - s@reserves@reserve_flags@ in __alloc_pages_slowpath - per Mel diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 7ae0f6d45614..90e331e4c077 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2948,7 +2948,9 @@ bool __zone_watermark_ok(struct zone *z, unsigned int order, unsigned long mark, } else { /* * OOM victims can try even harder than normal ALLOC_HARDER - * users + * users on the grounds that it's definitely going to be in + * the exit path shortly and free memory. Any allocation it + * makes during the free path will be small and short-lived. */ if (alloc_flags & ALLOC_OOM) min -= min / 2; @@ -3651,7 +3653,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask) bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) { - return __gfp_pfmemalloc_flags(gfp_mask) > 0; + return !!__gfp_pfmemalloc_flags(gfp_mask); } /* @@ -3804,7 +3806,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, unsigned long alloc_start = jiffies; unsigned int stall_timeout = 10 * HZ; unsigned int cpuset_mems_cookie; - int reserves; + int reserve_flags; /* * In the slowpath, we sanity check order to avoid ever trying to @@ -3910,16 +3912,16 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (gfp_mask & __GFP_KSWAPD_RECLAIM) wake_all_kswapds(order, ac); - reserves = __gfp_pfmemalloc_flags(gfp_mask); - if (reserves) - alloc_flags = reserves; + reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); + if (reserve_flags) + alloc_flags = reserve_flags; /* * Reset the zonelist iterators if memory policies can be ignored. * These allocations are high priority and system rather than user * orientated. */ - if (!(alloc_flags & ALLOC_CPUSET) || reserves) { + if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) { ac->zonelist = node_zonelist(numa_node_id(), gfp_mask); ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, ac->high_zoneidx, ac->nodemask); -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>