On Thu, Sep 16, 2010 at 07:01:32PM +0900, KOSAKI Motohiro wrote: > Cc to linux-mm and hpc guys. and intetionally full quote. > > > > So over the last couple of weeks, I've noticed that our shiny new IMAP > > servers (Dual Xeon E5520 + Intel S5520UR MB) with 48G of RAM haven't > > been performing as well as expected, and there were some big oddities. > > Namely two things stuck out: > > > > 1. There was free memory. There's 20T of data on these machines. The > > kernel should have used lots of memory for caching, but for some > > reason, it wasn't. cache ~ 2G, buffers ~ 25G, unused ~ 5G > > 2. The machine has an SSD for very hot data. In total, there's about 16G > > of data on the SSD. Almost all of that 16G of data should end up > > being cached, so there should be little reading from the SSDs at all. > > Instead we saw at peak times 2k+ blocks read/s from the SSDs. Again a > > sign that caching wasn't working. > > > > After a bunch of googling, I found this thread. > > > > http://lkml.org/lkml/2009/5/12/586 > > > > It appears that patch never went anywhere, and zone_reclaim_mode is > > still defaulting to 1 on our pretty standard file/email/web server type > > machine with a NUMA kernel. > > > > By changing it to 0, we saw an immediate massive change in caching > > behaviour. Now cache ~ 27G, buffers ~ 7G and unused ~ 0.2G, and IO reads > > from the SSD dropped to 100/s instead of 2000/s. > > > > Having very little knowledge of what this actually does, I'd just > > like to point out that from a users point of view, it's really > > annoying for your machine to be crippled by a default kernel setting > > that's pretty obscure. > > > > I don't think our usage scenario of serving lots of files is that > > uncommon, every file server/email server/web server will be doing pretty > > much that and expecting a large part of their memory to be used as a > > cache, which clearly isn't what actually happens. > > > > Rob > > Rob Mueller > > robm@xxxxxxxxxxx > > > > Yes, sadly intel motherboard turn on zone_reclaim_mode by default. and > current zone_reclaim_mode doesn't fit file/web server usecase ;-) > > So, I've created new proof concept patch. This doesn't disable zone_reclaim > at all. Instead, distinguish for file cache and for anon allocation and > only file cache doesn't use zone-reclaim. > > That said, high-end hpc user often turn on cpuset.memory_spread_page and > they avoid this issue. But, why don't we consider avoid it by default? > > > Rob, I wonder if following patch help you. Could you please try it? > > > Subject: [RFC] vmscan: file cache doesn't use zone_reclaim by default > > --- > Need to removed debbuging piece. > > Documentation/sysctl/vm.txt | 7 +++---- > fs/inode.c | 2 +- > include/linux/gfp.h | 9 +++++++-- > include/linux/mmzone.h | 2 ++ > include/linux/swap.h | 6 ++++++ > mm/filemap.c | 1 + > mm/page_alloc.c | 8 +++++++- > mm/vmscan.c | 7 ++----- > mm/vmstat.c | 2 ++ > 9 files changed, 31 insertions(+), 13 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index b606c2c..4be569e 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -671,16 +671,15 @@ This is value ORed together of > 1 = Zone reclaim on > 2 = Zone reclaim writes dirty pages out > 4 = Zone reclaim swaps pages > +8 = Zone reclaim for file cache on > > zone_reclaim_mode is set during bootup to 1 if it is determined that pages > from remote zones will cause a measurable performance reduction. The > page allocator will then reclaim easily reusable pages (those page > cache pages that are currently not used) before allocating off node pages. > > -It may be beneficial to switch off zone reclaim if the system is > -used for a file server and all of memory should be used for caching files > -from disk. In that case the caching effect is more important than > -data locality. > +By default, for file cache allocation doesn't use zone reclaim. But > +It can be turned on manually. > > Allowing zone reclaim to write out pages stops processes that are > writing large amounts of data from dirtying pages on other nodes. Zone > diff --git a/fs/inode.c b/fs/inode.c > index 8646433..02a51b1 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -166,7 +166,7 @@ int inode_init_always(struct super_block *sb, struct inode *inode) > mapping->a_ops = &empty_aops; > mapping->host = inode; > mapping->flags = 0; > - mapping_set_gfp_mask(mapping, GFP_HIGHUSER_MOVABLE); > + mapping_set_gfp_mask(mapping, GFP_FILE_CACHE); > mapping->assoc_mapping = NULL; > mapping->backing_dev_info = &default_backing_dev_info; > mapping->writeback_index = 0; > diff --git a/include/linux/gfp.h b/include/linux/gfp.h > index 975609c..f263b1f 100644 > --- a/include/linux/gfp.h > +++ b/include/linux/gfp.h > @@ -84,6 +84,10 @@ struct vm_area_struct; > #define GFP_HIGHUSER_MOVABLE (__GFP_WAIT | __GFP_IO | __GFP_FS | \ > __GFP_HARDWALL | __GFP_HIGHMEM | \ > __GFP_MOVABLE) > + > +#define GFP_FILE_CACHE (GFP_HIGHUSER | __GFP_RECLAIMABLE | __GFP_MOVABLE) > + This mask of both __GFP_RECLAIMABLE and __GFP_MOVABLE makes no sense at all in terms of fragmentation avoidance. In fact, I'm surprised it didn't trigger the warning in allocflags_to_migratetype() during your testing. > + > #define GFP_IOFS (__GFP_IO | __GFP_FS) > > #ifdef CONFIG_NUMA > @@ -120,11 +124,12 @@ struct vm_area_struct; > /* Convert GFP flags to their corresponding migrate type */ > static inline int allocflags_to_migratetype(gfp_t gfp_flags) > { > - WARN_ON((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK); > - Ah, you deleted the check. > if (unlikely(page_group_by_mobility_disabled)) > return MIGRATE_UNMOVABLE; > > + if ((gfp_flags & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK) > + gfp_flags &= ~__GFP_RECLAIMABLE; > + So you delete the flag, maybe it's obvious why later. > /* Group based on mobility */ > return (((gfp_flags & __GFP_MOVABLE) != 0) << 1) | > ((gfp_flags & __GFP_RECLAIMABLE) != 0); > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6e6e626..2eead52 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -112,6 +112,8 @@ enum zone_stat_item { > NUMA_LOCAL, /* allocation from local node */ > NUMA_OTHER, /* allocation from other node */ > #endif > + NR_ZONE_CACHE_AVOID, > + NR_ZONE_RECLAIM, > NR_VM_ZONE_STAT_ITEMS }; > > /* > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 2fee51a..487bc3b 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -65,6 +65,12 @@ static inline int current_is_kswapd(void) > #define MAX_SWAPFILES \ > ((1 << MAX_SWAPFILES_SHIFT) - SWP_MIGRATION_NUM - SWP_HWPOISON_NUM) > > +#define RECLAIM_OFF 0 > +#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ > +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > +#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ > +#define RECLAIM_CACHE (1<<3) /* Reclaim even though file cache purpose allocation */ > + > /* > * Magic header for a swap area. The first part of the union is > * what the swap magic looks like for the old (limited to 128MB) > diff --git a/mm/filemap.c b/mm/filemap.c > index 3d4df44..97298c0 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -468,6 +468,7 @@ struct page *__page_cache_alloc(gfp_t gfp) > if (cpuset_do_page_mem_spread()) { > get_mems_allowed(); > n = cpuset_mem_spread_node(); > + gfp &= ~__GFP_RECLAIMABLE; > page = alloc_pages_exact_node(n, gfp, 0); > put_mems_allowed(); > return page; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 8587c10..f81c28f 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1646,9 +1646,15 @@ zonelist_scan: > classzone_idx, alloc_flags)) > goto try_this_zone; > > - if (zone_reclaim_mode == 0) > + if (zone_reclaim_mode == RECLAIM_OFF) > goto this_zone_full; > > + if (!(zone_reclaim_mode & RECLAIM_CACHE) && > + (gfp_mask & GFP_MOVABLE_MASK) == GFP_MOVABLE_MASK) { > + inc_zone_state(zone, NR_ZONE_CACHE_AVOID); > + goto try_next_zone; > + } > + That doesn't look very nice. There has to be a better way of identifying what sort of allocations to avoid reclaim_mode for than passing in a meaningless combination of migrate flags. Are we out of GFP flags? Whether it is one that specifies it's an allocation for file-backed page cache or something that indicates reclaim_mode is unnecessary, I don't really mind but it shouldn't be magically encoded in the migrate flags. I don't think we will ever get the default value for this tunable right. I would also worry that avoiding the reclaim_mode for file-backed cache will hurt HPC applications that are dumping their data to disk and depending on the existing default for zone_reclaim_mode to not pollute other nodes. The ideal would be if distribution packages for mail, web servers and others that are heavily IO orientated would prompt for a change to the default value of zone_reclaim_mode in sysctl. > ret = zone_reclaim(zone, gfp_mask, order); > switch (ret) { > case ZONE_RECLAIM_NOSCAN: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index c391c32..6f63eea 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2558,11 +2558,6 @@ module_init(kswapd_init) > */ > int zone_reclaim_mode __read_mostly; > > -#define RECLAIM_OFF 0 > -#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ > -#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ > -#define RECLAIM_SWAP (1<<2) /* Swap pages out during reclaim */ > - > /* > * Priority for ZONE_RECLAIM. This determines the fraction of pages > * of a node considered for each zone_reclaim. 4 scans 1/16th of > @@ -2646,6 +2641,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) > }; > unsigned long nr_slab_pages0, nr_slab_pages1; > > + inc_zone_state(zone, NR_ZONE_RECLAIM); > + > cond_resched(); > /* > * We need to be able to allocate from the reserves for RECLAIM_SWAP > diff --git a/mm/vmstat.c b/mm/vmstat.c > index f389168..8988688 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -740,6 +740,8 @@ static const char * const vmstat_text[] = { > "numa_local", > "numa_other", > #endif > + "zone_cache_avoid", > + "zone_reclaim", > > #ifdef CONFIG_VM_EVENT_COUNTERS > "pgpgin", > -- > 1.6.5.2 > > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>