On Sat 28-10-17 17:07:09, Tetsuo Handa wrote: > This patch splits last second allocation attempt into two locations, once > before selecting an OOM victim and again after selecting an OOM victim, > and uses normal watermark for last second allocation attempts. Why do we need both? > As of linux-2.6.11, nothing prevented from concurrently calling > out_of_memory(). TIF_MEMDIE test in select_bad_process() tried to avoid > needless OOM killing. Thus, it was safe to do __GFP_DIRECT_RECLAIM > allocation (apart from which watermark should be used) just before > calling out_of_memory(). > > As of linux-2.6.24, try_set_zone_oom() was added to > __alloc_pages_may_oom() by commit ff0ceb9deb6eb017 ("oom: serialize out > of memory calls") which effectively started acting as a kind of today's > mutex_trylock(&oom_lock). > > As of linux-4.2, try_set_zone_oom() was replaced with oom_lock by > commit dc56401fc9f25e8f ("mm: oom_kill: simplify OOM killer locking"). > At least by this time, it became no longer safe to do > __GFP_DIRECT_RECLAIM allocation with oom_lock held. > > And as of linux-4.13, last second allocation attempt stopped using > __GFP_DIRECT_RECLAIM by commit e746bf730a76fe53 ("mm,page_alloc: don't > call __node_reclaim() with oom_lock held."). > > Therefore, there is no longer valid reason to use ALLOC_WMARK_HIGH for > last second allocation attempt [1]. Another reason to use the high watermark as explained by Andrea was " : Elaborating the comment: the reason for the high wmark is to reduce : the likelihood of livelocks and be sure to invoke the OOM killer, if : we're still under pressure and reclaim just failed. The high wmark is : used to be sure the failure of reclaim isn't going to be ignored. If : using the min wmark like you propose there's risk of livelock or : anyway of delayed OOM killer invocation. " How is that affected by changes in locking you discribe above? > And this patch changes to do normal > allocation attempt, with handling of ALLOC_OOM added in order to mitigate > extra OOM victim selection problem reported by Manish Jaggi [2]. > > Doing really last second allocation attempt after selecting an OOM victim > will also help the OOM reaper to start reclaiming memory without waiting > for oom_lock to be released. The changelog is much more obscure than it really needs to be. You fail to explain _why_ we need this and and _what_ the actual problem is. You are simply drowning in details here (btw. this is not the first time your changelog has this issues). Try to focus on _what_ is the problem _why_ do we care and _how_ are you addressing it. [...] > +struct page *alloc_pages_before_oomkill(struct oom_control *oc) > +{ > + /* > + * Make sure that this allocation attempt shall not depend on > + * __GFP_DIRECT_RECLAIM && !__GFP_NORETRY allocation, for the caller is > + * already holding oom_lock. > + */ > + const gfp_t gfp_mask = oc->gfp_mask & ~__GFP_DIRECT_RECLAIM; > + struct alloc_context *ac = oc->ac; > + unsigned int alloc_flags = gfp_to_alloc_flags(gfp_mask); > + const int reserve_flags = __gfp_pfmemalloc_flags(gfp_mask); > + > + /* Need to update zonelist if selected as OOM victim. */ > + if (reserve_flags) { > + alloc_flags = reserve_flags; > + ac->zonelist = node_zonelist(numa_node_id(), gfp_mask); > + ac->preferred_zoneref = first_zones_zonelist(ac->zonelist, > + ac->high_zoneidx, ac->nodemask); > + } Why do we need this zone list rebuilding? > + return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, ac); > +} > + > static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order, > int preferred_nid, nodemask_t *nodemask, > struct alloc_context *ac, gfp_t *alloc_mask, > -- > 1.8.3.1 -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>