On Fri 30-09-11 09:17:22, Johannes Weiner wrote: > The maximum number of dirty pages that exist in the system at any time > is determined by a number of pages considered dirtyable and a > user-configured percentage of those, or an absolute number in bytes. > > This number of dirtyable pages is the sum of memory provided by all > the zones in the system minus their lowmem reserves and high > watermarks, so that the system can retain a healthy number of free > pages without having to reclaim dirty pages. > > But there is a flaw in that we have a zoned page allocator which does > not care about the global state but rather the state of individual > memory zones. And right now there is nothing that prevents one zone > from filling up with dirty pages while other zones are spared, which > frequently leads to situations where kswapd, in order to restore the > watermark of free pages, does indeed have to write pages from that > zone's LRU list. This can interfere so badly with IO from the flusher > threads that major filesystems (btrfs, xfs, ext4) mostly ignore write > requests from reclaim already, taking away the VM's only possibility > to keep such a zone balanced, aside from hoping the flushers will soon > clean pages from that zone. > > Enter per-zone dirty limits. They are to a zone's dirtyable memory > what the global limit is to the global amount of dirtyable memory, and > try to make sure that no single zone receives more than its fair share > of the globally allowed dirty pages in the first place. As the number > of pages considered dirtyable exclude the zones' lowmem reserves and > high watermarks, the maximum number of dirty pages in a zone is such > that the zone can always be balanced without requiring page cleaning. > > As this is a placement decision in the page allocator and pages are > dirtied only after the allocation, this patch allows allocators to > pass __GFP_WRITE when they know in advance that the page will be > written to and become dirty soon. The page allocator will then > attempt to allocate from the first zone of the zonelist - which on > NUMA is determined by the task's NUMA memory policy - that has not > exceeded its dirty limit. > > At first glance, it would appear that the diversion to lower zones can > increase pressure on them, but this is not the case. With a full high > zone, allocations will be diverted to lower zones eventually, so it is > more of a shift in timing of the lower zone allocations. Workloads > that previously could fit their dirty pages completely in the higher > zone may be forced to allocate from lower zones, but the amount of > pages that 'spill over' are limited themselves by the lower zones' > dirty constraints, and thus unlikely to become a problem. > > For now, the problem of unfair dirty page distribution remains for > NUMA configurations where the zones allowed for allocation are in sum > not big enough to trigger the global dirty limits, wake up the flusher > threads and remedy the situation. Because of this, an allocation that > could not succeed on any of the considered zones is allowed to ignore > the dirty limits before going into direct reclaim or even failing the > allocation, until a future patch changes the global dirty throttling > and flusher thread activation so that they take individual zone states > into account. > > Signed-off-by: Johannes Weiner <jweiner@xxxxxxxxxx> > Reviewed-by: Minchan Kim <minchan.kim@xxxxxxxxx> > Acked-by: Mel Gorman <mgorman@xxxxxxx> Nice Reviewed-by: Michal Hocko <mhocko@xxxxxxx> > --- > include/linux/gfp.h | 4 ++- > include/linux/writeback.h | 1 + > mm/page-writeback.c | 83 +++++++++++++++++++++++++++++++++++++++++++++ > mm/page_alloc.c | 29 ++++++++++++++++ > 4 files changed, 116 insertions(+), 1 deletions(-) > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 3a76faf..50efc7e 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -36,6 +36,7 @@ struct vm_area_struct; > #endif > #define ___GFP_NO_KSWAPD 0x400000u > #define ___GFP_OTHER_NODE 0x800000u > +#define ___GFP_WRITE 0x1000000u > > /* > * GFP bitmasks.. > @@ -85,6 +86,7 @@ struct vm_area_struct; > > #define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD) > #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */ > +#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */ > > /* > * This may seem redundant, but it's a way of annotating false positives vs. > @@ -92,7 +94,7 @@ struct vm_area_struct; > */ > #define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK) > > -#define __GFP_BITS_SHIFT 24 /* Room for N __GFP_FOO bits */ > +#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */ > #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) > > /* This equals 0, but use constants in case they ever change */ > diff --git a/include/linux/writeback.h b/include/linux/writeback.h > index a5f495f..c96ee0c 100644 > --- a/include/linux/writeback.h > +++ b/include/linux/writeback.h > @@ -104,6 +104,7 @@ void laptop_mode_timer_fn(unsigned long data); > static inline void laptop_sync_completion(void) { } > #endif > void throttle_vm_writeout(gfp_t gfp_mask); > +bool zone_dirty_ok(struct zone *zone); > > extern unsigned long global_dirty_limit; > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c > index 78604a6..f60fd57 100644 > --- a/mm/page-writeback.c > +++ b/mm/page-writeback.c > @@ -159,6 +159,25 @@ static struct prop_descriptor vm_dirties; > * We make sure that the background writeout level is below the adjusted > * clamping level. > */ > + > +/* > + * In a memory zone, there is a certain amount of pages we consider > + * available for the page cache, which is essentially the number of > + * free and reclaimable pages, minus some zone reserves to protect > + * lowmem and the ability to uphold the zone's watermarks without > + * requiring writeback. > + * > + * This number of dirtyable pages is the base value of which the > + * user-configurable dirty ratio is the effictive number of pages that > + * are allowed to be actually dirtied. Per individual zone, or > + * globally by using the sum of dirtyable pages over all zones. > + * > + * Because the user is allowed to specify the dirty limit globally as > + * absolute number of bytes, calculating the per-zone dirty limit can > + * require translating the configured limit into a percentage of > + * global dirtyable memory first. > + */ > + > static unsigned long highmem_dirtyable_memory(unsigned long total) > { > #ifdef CONFIG_HIGHMEM > @@ -245,6 +264,70 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) > trace_global_dirty_state(background, dirty); > } > > +/** > + * zone_dirtyable_memory - number of dirtyable pages in a zone > + * @zone: the zone > + * > + * Returns the zone's number of pages potentially available for dirty > + * page cache. This is the base value for the per-zone dirty limits. > + */ > +static unsigned long zone_dirtyable_memory(struct zone *zone) > +{ > + /* > + * The effective global number of dirtyable pages may exclude > + * highmem as a big-picture measure to keep the ratio between > + * dirty memory and lowmem reasonable. > + * > + * But this function is purely about the individual zone and a > + * highmem zone can hold its share of dirty pages, so we don't > + * care about vm_highmem_is_dirtyable here. > + */ > + return zone_page_state(zone, NR_FREE_PAGES) + > + zone_reclaimable_pages(zone) - > + zone->dirty_balance_reserve; > +} > + > +/** > + * zone_dirty_limit - maximum number of dirty pages allowed in a zone > + * @zone: the zone > + * > + * Returns the maximum number of dirty pages allowed in a zone, based > + * on the zone's dirtyable memory. > + */ > +static unsigned long zone_dirty_limit(struct zone *zone) > +{ > + unsigned long zone_memory = zone_dirtyable_memory(zone); > + struct task_struct *tsk = current; > + unsigned long dirty; > + > + if (vm_dirty_bytes) > + dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) * > + zone_memory / global_dirtyable_memory(); > + else > + dirty = vm_dirty_ratio * zone_memory / 100; > + > + if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) > + dirty += dirty / 4; > + > + return dirty; > +} > + > +/** > + * zone_dirty_ok - tells whether a zone is within its dirty limits > + * @zone: the zone to check > + * > + * Returns %true when the dirty pages in @zone are within the zone's > + * dirty limit, %false if the limit is exceeded. > + */ > +bool zone_dirty_ok(struct zone *zone) > +{ > + unsigned long limit = zone_dirty_limit(zone); > + > + return zone_page_state(zone, NR_FILE_DIRTY) + > + zone_page_state(zone, NR_UNSTABLE_NFS) + > + zone_page_state(zone, NR_WRITEBACK) <= limit; > +} > + > /* > * couple the period to the dirty_ratio: > * > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index f8cba89..afaf59e 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1675,6 +1675,35 @@ zonelist_scan: > if ((alloc_flags & ALLOC_CPUSET) && > !cpuset_zone_allowed_softwall(zone, gfp_mask)) > continue; > + /* > + * When allocating a page cache page for writing, we > + * want to get it from a zone that is within its dirty > + * limit, such that no single zone holds more than its > + * proportional share of globally allowed dirty pages. > + * The dirty limits take into account the zone's > + * lowmem reserves and high watermark so that kswapd > + * should be able to balance it without having to > + * write pages from its LRU list. > + * > + * This may look like it could increase pressure on > + * lower zones by failing allocations in higher zones > + * before they are full. But the pages that do spill > + * over are limited as the lower zones are protected > + * by this very same mechanism. It should not become > + * a practical burden to them. > + * > + * XXX: For now, allow allocations to potentially > + * exceed the per-zone dirty limit in the slowpath > + * (ALLOC_WMARK_LOW unset) before going into reclaim, > + * which is important when on a NUMA setup the allowed > + * zones are together not big enough to reach the > + * global limit. The proper fix for these situations > + * will require awareness of zones in the > + * dirty-throttling and the flusher threads. > + */ > + if ((alloc_flags & ALLOC_WMARK_LOW) && > + (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone)) > + goto this_zone_full; > > BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK); > if (!(alloc_flags & ALLOC_NO_WATERMARKS)) { > -- > 1.7.6.2 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@xxxxxxxxx. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/ > Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> -- Michal Hocko SUSE Labs SUSE LINUX s.r.o. Lihovarska 1060/12 190 00 Praha 9 Czech Republic -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html