On Wed, 25 Sep 2019, Michal Hocko wrote: > I am especially interested about this part. The more I think about this > the more I am convinced that the underlying problem really is in the pre > mature fallback in the fast path. I appreciate you taking the time to continue to look at this but I'm confused about the underlying problem you're referring to: we had no underlying problem until 5.3 was released so we need to carry patches that will revert this behavior (we simply can't tolerate double digit memory access latency regressions lol). If you're referring to post-5.3 behavior, this appears to override alloc_hugepage_direct_gfpmask() but directly in the page allocator. static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) { ... /* * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not * specified, to express a general desire to stay on the current Your patch is setting __GFP_THISNODE for __GFP_DIRECT_RECLAIM: this allocation will fail in the fastpath for both my case (fragmented local node) and Andrea's case (out of memory local node). The first get_page_from_freelist() will then succeed in the slowpath for both cases; compaction is not tried for either. In my case, that results in a perpetual remote access latency that we can't tolerate. If Andrea's remote nodes are fragmented or low on memory, his case encounters swap storms over both the local node and remote nodes. So I'm not really sure what is solved by your patch? We're not on 5.3, but I can try it and collect data on exactly how poorly it performs on fragmented *hosts* (not single nodes, but really the whole system) because swap storms were never fixed here, it was only papered over. > Does the almost-patch below helps your > workload? It effectively reduces the fast path for higher order > allocations to the local/requested node. The justification is that > watermark check might be too strict for those requests as it is primary > order-0 oriented. Low watermark target simply has no meaning for the > higher order requests AFAIU. The min-low gap is giving kswapd a chance > to balance and be more local node friendly while we do not have anything > like that in compaction. > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ff5484fdbdf9..09036cf55fca 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4685,7 +4685,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > { > struct page *page; > unsigned int alloc_flags = ALLOC_WMARK_LOW; > - gfp_t alloc_mask; /* The gfp_t that was actually used for allocation */ > + gfp_t fastpath_mask, alloc_mask; /* The gfp_t that was actually used for allocation */ > struct alloc_context ac = { }; > > /* > @@ -4698,7 +4698,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > } > > gfp_mask &= gfp_allowed_mask; > - alloc_mask = gfp_mask; > + fastpath_mask = alloc_mask = gfp_mask; > if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags)) > return NULL; > > @@ -4710,8 +4710,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid, > */ > alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp_mask); > > - /* First allocation attempt */ > - page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); > + /* > + * First allocation attempt. If we have a high order allocation then do not fall > + * back to a remote node just based on the watermark check on the requested node > + * because compaction might easily free up a requested order and then it would be > + * better to simply go to the slow path. > + * TODO: kcompactd should help here but nobody has woken it up unless we hit the > + * slow path so we might need some tuning there as well. > + */ > + if (order && (gfp_mask & __GFP_DIRECT_RECLAIM)) > + fastpath_mask |= __GFP_THISNODE; > + page = get_page_from_freelist(fastpath_mask, order, alloc_flags, &ac); > if (likely(page)) > goto out;