Re: [PATCH 1/9] mm, page_alloc: rip out ZONELIST_ORDER_ZONE

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ups, linux-api, didn't make it into the CC list. Let's add it here.
The cover which will give you some context is here
http://lkml.kernel.org/r/20170721143915.14161-1-mhocko@xxxxxxxxxx

On Fri 21-07-17 16:39:07, Michal Hocko wrote:
> From: Michal Hocko <mhocko@xxxxxxxx>
> 
> Supporting zone ordered zonelists costs us just a lot of code while
> the usefulness is arguable if existent at all. Mel has already made
> node ordering default on 64b systems. 32b systems are still using
> ZONELIST_ORDER_ZONE because it is considered better to fallback to
> a different NUMA node rather than consume precious lowmem zones.
> 
> This argument is, however, weaken by the fact that the memory reclaim
> has been reworked to be node rather than zone oriented. This means
> that lowmem requests have to skip over all highmem pages on LRUs already
> and so zone ordering doesn't save the reclaim time much. So the only
> advantage of the zone ordering is under a light memory pressure when
> highmem requests do not ever hit into lowmem zones and the lowmem
> pressure doesn't need to reclaim.
> 
> Considering that 32b NUMA systems are rather suboptimal already and
> it is generally advisable to use 64b kernel on such a HW I believe we
> should rather care about the code maintainability and just get rid of
> ZONELIST_ORDER_ZONE altogether. Keep systcl in place and warn if
> somebody tries to set zone ordering either from kernel command line
> or the sysctl.
> 
> Changes since v1
> - do not print Default in sysctl handler because our behavior was rather
>   inconsistent in the past numa_zonelist_order was lowecase while
>   zonelist_order_name was uppercase so boot time unchanged value woul
>   print lowercase while updated value could be uppercase. Print "Node"
>   which is the default instead - Mel
> - update documentation - Vlastimil
> 
> Cc: linux-api@xxxxxxxxxxxxxxx
> Acked-by: Mel Gorman <mgorman@xxxxxxx>
> Acked-by: Vlastimil Babka <vbabka@xxxxxxx>
> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |   2 +-
>  Documentation/sysctl/vm.txt                     |   4 +-
>  Documentation/vm/numa                           |   7 +-
>  include/linux/mmzone.h                          |   2 -
>  kernel/sysctl.c                                 |   2 -
>  mm/page_alloc.c                                 | 178 +++---------------------
>  6 files changed, 29 insertions(+), 166 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 60530c8490ff..28f1a0f84456 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2724,7 +2724,7 @@
>  			Allowed values are enable and disable
>  
>  	numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
> -			one of ['zone', 'node', 'default'] can be specified
> +			'node', 'default' can be specified
>  			This can be set from sysctl after boot.
>  			See Documentation/sysctl/vm.txt for details.
>  
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 48244c42ff52..9baf66a9ef4e 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
>  
>  numa_zonelist_order
>  
> -This sysctl is only for NUMA.
> +This sysctl is only for NUMA and it is deprecated. Anything but
> +Node order will fail!
> +
>  'where the memory is allocated from' is controlled by zonelists.
>  (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
>   you may be able to read ZONE_DMA as ZONE_DMA32...)
> diff --git a/Documentation/vm/numa b/Documentation/vm/numa
> index a08f71647714..a31b85b9bb88 100644
> --- a/Documentation/vm/numa
> +++ b/Documentation/vm/numa
> @@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
>  fall back to the same zone type on a different node, or to a different zone
>  type on the same node.  This is an important consideration because some zones,
>  such as DMA or DMA32, represent relatively scarce resources.  Linux chooses
> -a default zonelist order based on the sizes of the various zone types relative
> -to the total memory of the node and the total memory of the system.  The
> -default zonelist order may be overridden using the numa_zonelist_order kernel
> -boot parameter or sysctl.  [see Documentation/admin-guide/kernel-parameters.rst and
> -Documentation/sysctl/vm.txt]
> +a default Node ordered zonelist. This means it tries to fallback to other zones
> +from the same node before using remote nodes which are ordered by NUMA distance.
>  
>  By default, Linux will attempt to satisfy memory allocation requests from the
>  node to which the CPU that executes the request is assigned.  Specifically,
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index fc14b8b3f6ce..b849006b20d3 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -895,8 +895,6 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
>  
>  extern int numa_zonelist_order_handler(struct ctl_table *, int,
>  			void __user *, size_t *, loff_t *);
> -extern char numa_zonelist_order[];
> -#define NUMA_ZONELIST_ORDER_LEN 16	/* string buffer size */
>  
>  #ifndef CONFIG_NEED_MULTIPLE_NODES
>  
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 655686d546cb..0cbce40f5426 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1553,8 +1553,6 @@ static struct ctl_table vm_table[] = {
>  #ifdef CONFIG_NUMA
>  	{
>  		.procname	= "numa_zonelist_order",
> -		.data		= &numa_zonelist_order,
> -		.maxlen		= NUMA_ZONELIST_ORDER_LEN,
>  		.mode		= 0644,
>  		.proc_handler	= numa_zonelist_order_handler,
>  	},
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 80e4adb4c360..b839a90e24c8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4791,52 +4791,18 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
>  	return nr_zones;
>  }
>  
> -
> -/*
> - *  zonelist_order:
> - *  0 = automatic detection of better ordering.
> - *  1 = order by ([node] distance, -zonetype)
> - *  2 = order by (-zonetype, [node] distance)
> - *
> - *  If not NUMA, ZONELIST_ORDER_ZONE and ZONELIST_ORDER_NODE will create
> - *  the same zonelist. So only NUMA can configure this param.
> - */
> -#define ZONELIST_ORDER_DEFAULT  0
> -#define ZONELIST_ORDER_NODE     1
> -#define ZONELIST_ORDER_ZONE     2
> -
> -/* zonelist order in the kernel.
> - * set_zonelist_order() will set this to NODE or ZONE.
> - */
> -static int current_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -static char zonelist_order_name[3][8] = {"Default", "Node", "Zone"};
> -
> -
>  #ifdef CONFIG_NUMA
> -/* The value user specified ....changed by config */
> -static int user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -/* string for sysctl */
> -#define NUMA_ZONELIST_ORDER_LEN	16
> -char numa_zonelist_order[16] = "default";
> -
> -/*
> - * interface for configure zonelist ordering.
> - * command line option "numa_zonelist_order"
> - *	= "[dD]efault	- default, automatic configuration.
> - *	= "[nN]ode 	- order by node locality, then by zone within node
> - *	= "[zZ]one      - order by zone, then by locality within zone
> - */
>  
>  static int __parse_numa_zonelist_order(char *s)
>  {
> -	if (*s == 'd' || *s == 'D') {
> -		user_zonelist_order = ZONELIST_ORDER_DEFAULT;
> -	} else if (*s == 'n' || *s == 'N') {
> -		user_zonelist_order = ZONELIST_ORDER_NODE;
> -	} else if (*s == 'z' || *s == 'Z') {
> -		user_zonelist_order = ZONELIST_ORDER_ZONE;
> -	} else {
> -		pr_warn("Ignoring invalid numa_zonelist_order value:  %s\n", s);
> +	/*
> +	 * We used to support different zonlists modes but they turned
> +	 * out to be just not useful. Let's keep the warning in place
> +	 * if somebody still use the cmd line parameter so that we do
> +	 * not fail it silently
> +	 */
> +	if (!(*s == 'd' || *s == 'D' || *s == 'n' || *s == 'N')) {
> +		pr_warn("Ignoring unsupported numa_zonelist_order value:  %s\n", s);
>  		return -EINVAL;
>  	}
>  	return 0;
> @@ -4844,16 +4810,10 @@ static int __parse_numa_zonelist_order(char *s)
>  
>  static __init int setup_numa_zonelist_order(char *s)
>  {
> -	int ret;
> -
>  	if (!s)
>  		return 0;
>  
> -	ret = __parse_numa_zonelist_order(s);
> -	if (ret == 0)
> -		strlcpy(numa_zonelist_order, s, NUMA_ZONELIST_ORDER_LEN);
> -
> -	return ret;
> +	return __parse_numa_zonelist_order(s);
>  }
>  early_param("numa_zonelist_order", setup_numa_zonelist_order);
>  
> @@ -4864,40 +4824,22 @@ int numa_zonelist_order_handler(struct ctl_table *table, int write,
>  		void __user *buffer, size_t *length,
>  		loff_t *ppos)
>  {
> -	char saved_string[NUMA_ZONELIST_ORDER_LEN];
> +	char *str;
>  	int ret;
> -	static DEFINE_MUTEX(zl_order_mutex);
>  
> -	mutex_lock(&zl_order_mutex);
> -	if (write) {
> -		if (strlen((char *)table->data) >= NUMA_ZONELIST_ORDER_LEN) {
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -		strcpy(saved_string, (char *)table->data);
> +	if (!write) {
> +		int len = sizeof("Node");
> +		if (copy_to_user(buffer, "Node", len))
> +			return -EFAULT;
> +		return len;
>  	}
> -	ret = proc_dostring(table, write, buffer, length, ppos);
> -	if (ret)
> -		goto out;
> -	if (write) {
> -		int oldval = user_zonelist_order;
>  
> -		ret = __parse_numa_zonelist_order((char *)table->data);
> -		if (ret) {
> -			/*
> -			 * bogus value.  restore saved string
> -			 */
> -			strncpy((char *)table->data, saved_string,
> -				NUMA_ZONELIST_ORDER_LEN);
> -			user_zonelist_order = oldval;
> -		} else if (oldval != user_zonelist_order) {
> -			mutex_lock(&zonelists_mutex);
> -			build_all_zonelists(NULL, NULL);
> -			mutex_unlock(&zonelists_mutex);
> -		}
> -	}
> -out:
> -	mutex_unlock(&zl_order_mutex);
> +	str = memdup_user_nul(buffer, 16);
> +	if (IS_ERR(str))
> +		return PTR_ERR(str);
> +
> +	ret = __parse_numa_zonelist_order(str);
> +	kfree(str);
>  	return ret;
>  }
>  
> @@ -5006,70 +4948,12 @@ static void build_thisnode_zonelists(pg_data_t *pgdat)
>   */
>  static int node_order[MAX_NUMNODES];
>  
> -static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
> -{
> -	int pos, j, node;
> -	int zone_type;		/* needs to be signed */
> -	struct zone *z;
> -	struct zonelist *zonelist;
> -
> -	zonelist = &pgdat->node_zonelists[ZONELIST_FALLBACK];
> -	pos = 0;
> -	for (zone_type = MAX_NR_ZONES - 1; zone_type >= 0; zone_type--) {
> -		for (j = 0; j < nr_nodes; j++) {
> -			node = node_order[j];
> -			z = &NODE_DATA(node)->node_zones[zone_type];
> -			if (managed_zone(z)) {
> -				zoneref_set_zone(z,
> -					&zonelist->_zonerefs[pos++]);
> -				check_highest_zone(zone_type);
> -			}
> -		}
> -	}
> -	zonelist->_zonerefs[pos].zone = NULL;
> -	zonelist->_zonerefs[pos].zone_idx = 0;
> -}
> -
> -#if defined(CONFIG_64BIT)
> -/*
> - * Devices that require DMA32/DMA are relatively rare and do not justify a
> - * penalty to every machine in case the specialised case applies. Default
> - * to Node-ordering on 64-bit NUMA machines
> - */
> -static int default_zonelist_order(void)
> -{
> -	return ZONELIST_ORDER_NODE;
> -}
> -#else
> -/*
> - * On 32-bit, the Normal zone needs to be preserved for allocations accessible
> - * by the kernel. If processes running on node 0 deplete the low memory zone
> - * then reclaim will occur more frequency increasing stalls and potentially
> - * be easier to OOM if a large percentage of the zone is under writeback or
> - * dirty. The problem is significantly worse if CONFIG_HIGHPTE is not set.
> - * Hence, default to zone ordering on 32-bit.
> - */
> -static int default_zonelist_order(void)
> -{
> -	return ZONELIST_ORDER_ZONE;
> -}
> -#endif /* CONFIG_64BIT */
> -
> -static void set_zonelist_order(void)
> -{
> -	if (user_zonelist_order == ZONELIST_ORDER_DEFAULT)
> -		current_zonelist_order = default_zonelist_order();
> -	else
> -		current_zonelist_order = user_zonelist_order;
> -}
> -
>  static void build_zonelists(pg_data_t *pgdat)
>  {
>  	int i, node, load;
>  	nodemask_t used_mask;
>  	int local_node, prev_node;
>  	struct zonelist *zonelist;
> -	unsigned int order = current_zonelist_order;
>  
>  	/* initialize zonelists */
>  	for (i = 0; i < MAX_ZONELISTS; i++) {
> @@ -5099,15 +4983,7 @@ static void build_zonelists(pg_data_t *pgdat)
>  
>  		prev_node = node;
>  		load--;
> -		if (order == ZONELIST_ORDER_NODE)
> -			build_zonelists_in_node_order(pgdat, node);
> -		else
> -			node_order[i++] = node;	/* remember order */
> -	}
> -
> -	if (order == ZONELIST_ORDER_ZONE) {
> -		/* calculate node order -- i.e., DMA last! */
> -		build_zonelists_in_zone_order(pgdat, i);
> +		build_zonelists_in_node_order(pgdat, node);
>  	}
>  
>  	build_thisnode_zonelists(pgdat);
> @@ -5135,11 +5011,6 @@ static void setup_min_unmapped_ratio(void);
>  static void setup_min_slab_ratio(void);
>  #else	/* CONFIG_NUMA */
>  
> -static void set_zonelist_order(void)
> -{
> -	current_zonelist_order = ZONELIST_ORDER_ZONE;
> -}
> -
>  static void build_zonelists(pg_data_t *pgdat)
>  {
>  	int node, local_node;
> @@ -5279,8 +5150,6 @@ build_all_zonelists_init(void)
>   */
>  void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
>  {
> -	set_zonelist_order();
> -
>  	if (system_state == SYSTEM_BOOTING) {
>  		build_all_zonelists_init();
>  	} else {
> @@ -5306,9 +5175,8 @@ void __ref build_all_zonelists(pg_data_t *pgdat, struct zone *zone)
>  	else
>  		page_group_by_mobility_disabled = 0;
>  
> -	pr_info("Built %i zonelists in %s order, mobility grouping %s.  Total pages: %ld\n",
> +	pr_info("Built %i zonelists, mobility grouping %s.  Total pages: %ld\n",
>  		nr_online_nodes,
> -		zonelist_order_name[current_zonelist_order],
>  		page_group_by_mobility_disabled ? "off" : "on",
>  		vm_total_pages);
>  #ifdef CONFIG_NUMA
> -- 
> 2.11.0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]
  Powered by Linux