On Wed, Nov 21, 2018 at 03:18:28PM +0100, Vlastimil Babka wrote: > > 1-socket Skylake machine > > config-global-dhp__workload_thpfioscale XFS (no special madvise) > > 4 fio threads, 1 THP allocating thread > > -------------------------------------- > > > > 4.20-rc1 extfrag events < order 9: 1023463 > > 4.20-rc1+patch: 358574 (65% reduction) > > It would be nice to have also breakdown of what kind of extfrag events, > mainly distinguish number of unmovable/reclaimable allocations > fragmenting movable pageblocks, as those are the most critical ones. > I can include that but bear in mind that the volume of data is already extremely high. FWIW, the bulk of the fallbacks in this particular case happen to be movable. > > @@ -3253,6 +3268,36 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) > > } > > #endif /* CONFIG_NUMA */ > > > > +#ifdef CONFIG_ZONE_DMA32 > > +/* > > + * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid > > + * fragmentation is subtle. If the preferred zone was HIGHMEM then > > + * premature use of a lower zone may cause lowmem pressure problems that > > + * are wose than fragmentation. If the next zone is ZONE_DMA then it is > > + * probably too small. It only makes sense to spread allocations to avoid > > + * fragmentation between the Normal and DMA32 zones. > > + */ > > +static inline unsigned int alloc_flags_nofragment(struct zone *zone) > > +{ > > + if (zone_idx(zone) != ZONE_NORMAL) > > + return 0; > > + > > + /* > > + * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and > > + * the pointer is within zone->zone_pgdat->node_zones[]. > > + */ > > + if (!populated_zone(--zone)) > > + return 0; > > How about something along: > BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1); > Good plan. > Also is this perhaps going against your earlier efforts of speeding up > the fast path, and maybe it would be faster to just stick a bool into > struct zone, which would be set true once during zonelist build, only > for a ZONE_NORMAL with ZONE_DMA32 in the same node? > It does somewhat go against the previous work on the fast path but we really did hit the limits of the microoptimisations there and the longer-term consequences of fragmentation are potentially worse than a few cycles in each fast path. The speedup we need for extremely high network devices is much larger than a few cycles so I think we can take the hit -- at least until a better idea comes along. > > + > > + return ALLOC_NOFRAGMENT; > > +} > > +#else > > +static inline unsigned int alloc_flags_nofragment(struct zone *zone) > > +{ > > + return 0; > > +} > > +#endif > > + > > /* > > * get_page_from_freelist goes through the zonelist trying to allocate > > * a page. > > @@ -3264,11 +3309,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, > > struct zoneref *z = ac->preferred_zoneref; > > struct zone *zone; > > struct pglist_data *last_pgdat_dirty_limit = NULL; > > + bool no_fallback; > > > > +retry: > > Ugh, I think 'z = ac->preferred_zoneref' should be moved here under > retry. AFAICS without that, the preference of local node to > fragmentation avoidance doesn't work? > Yup, you're right! In the event of fragmentation of both normal and dma32 zone, it doesn't restart on the local node and instead falls over to the remote node prematurely. This is obviously not desirable. I'll give it and thanks for spotting it. -- Mel Gorman SUSE Labs