On Thu, Nov 22, 2018 at 06:02:10PM +0100, Vlastimil Babka wrote: > On 11/21/18 11:14 AM, Mel Gorman wrote: > > An event that potentially causes external fragmentation problems has > > already been described but there are degrees of severity. A "serious" > > event is defined as one that steals a contiguous range of pages of an order > > lower than fragment_stall_order (PAGE_ALLOC_COSTLY_ORDER by default). If a > > movable allocation request that is allowed to sleep needs to steal a small > > block then it schedules until kswapd makes progress or a timeout passes. > > The watermarks are also boosted slightly faster so that kswapd makes > > greater effort to reclaim enough pages to avoid the fragmentation event. > > > > This stall is not guaranteed to avoid serious fragmentation events. > > If memory pressure is high enough, the pages freed by kswapd may be > > reallocated or the free pages may not be in pageblocks that contain > > only movable pages. Furthermore an allocation request that cannot stall > > (e.g. atomic allocations) or unmovable/reclaimable allocations will still > > proceed without stalling. > > Not doing this for unmovable/reclaimable allocations is kinda disadvantage? > Yes, but this series is primarily aimed at when movable allocations are causing the fragmentation. We stall so that there are compaction targets due to reclaimed pages. The same does not apply to unmovable and reclaimable pages because they cannot compact so sure, we can stall, but I cannot see how it would help. > > ============================================================== > > > > +fragment_stall_order > > + > > +External fragmentation control is managed on a pageblock level where the > > +page allocator tries to avoid mixing pages of different mobility within page > > +blocks (e.g. order 9 on 64-bit x86). If external fragmentation is perfectly > > +controlled then a THP allocation will often succeed up to the number of > > +movable pageblocks in the system as reported by /proc/pagetypeinfo. > > + > > +When memory is low, the system may have to mix pageblocks and will wake > > +kswapd to try control future fragmentation. fragment_stall_order controls if > > +the allocating task will stall if possible until kswapd makes some progress > > +in preference to fragmenting the system. This incurs a small stall penalty > > +in exchange for future success at allocating huge pages. If the stalls > > +are undesirable and high-order allocations are irrelevant then this can > > +be disabled by writing 0 to the tunable. Writing the pageblock order will > > +strongly (but not perfectly) control external fragmentation. > > + > > +The default will stall for fragmenting allocations smaller than the > > +PAGE_ALLOC_COSTLY_ORDER (defined as order-3 at the time of writing). > > Perhaps be more explicit that steals of orders strictly lower than given > value will stall? So for the default order-3, the sysctl value is 4, > which might confuse somebody. > I'll clarify it. > > + > > @@ -2130,9 +2131,10 @@ static bool can_steal_fallback(unsigned int order, int start_mt) > > return false; > > } > > > > + > > +static void stall_fragmentation(struct zone *pzone) > > +{ > > + DEFINE_WAIT(wait); > > + long remaining = 0; > > + long timeout = HZ/50; > > + pg_data_t *pgdat = pzone->zone_pgdat; > > + > > + if (current->flags & PF_MEMALLOC) > > + return; > > + > > + boost_watermark(pzone, true); > > Should zone->lock be taken around this to make watermark_boost > adjustment safe? Similar to balance_pgdat(). > Yeah, that was a relatively late adjustment. The risk is low but it's possible best to be safe. I'm not super-keen that zone->lock protects this but that lock already protects more than it should and there is little motivation to split it just yet. > > + prepare_to_wait(&pgdat->pfmemalloc_wait, &wait, TASK_INTERRUPTIBLE); > > + if (waitqueue_active(&pgdat->kswapd_wait)) > > + wake_up_interruptible(&pgdat->kswapd_wait); > > + remaining = schedule_timeout(timeout); > > + finish_wait(&pgdat->pfmemalloc_wait, &wait); > > + if (remaining != timeout) { > > + trace_mm_fragmentation_stall(pgdat->node_id, > > + jiffies_to_usecs(timeout - remaining)); > > + count_vm_event(FRAGMENTSTALL); > > + } > > } > > > > > @@ -4186,6 +4234,14 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, > > */ > > alloc_flags = gfp_to_alloc_flags(gfp_mask); > > > > + /* > > + * Consider stalling on heavy for movable allocations in preference to > > + * fragmenting unmovable/reclaimable pageblocks. > > + */ > > + if ((gfp_mask & (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) == > > + (__GFP_MOVABLE|__GFP_DIRECT_RECLAIM)) > > + alloc_flags |= ALLOC_FRAGMENT_STALL; > > Surprised that this only has effect in the slowpath, i.e. when > watermarks are below 'low'. If it's intended (to not stall that much I > suppose) maybe explain the rationale in the changelog? > Well, it's the same path when stalls can happen on direct reclaim so I didn't think it needed to be explicitly called out. The slowpath is also the "we can stall if the context allows" path so this is checking that it's a compatible cont3xt. > Thanks for the series, Mel, hope the results are still optimistic after > some of the fixes that might unfortunately limit its impact :) > Preliminary results indicate they are slightly worse but slightly worse than 90+% is still better than nothing so I'm reasonably optimistic! Thanks for the careful review and catching a lot of issues! -- Mel Gorman SUSE Labs