Re: [RFC 1/2] Protect larger order pages from breaking up

Thomas Schoebel-Theuer <tst@xxxxxxxxxxxxxxxxxx> · Thu, 22 Feb 2018 22:19:32 +0100

On 02/19/18 11:19, Mel Gorman wrote:

Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c
+++ linux/mm/page_alloc.c
@@ -1844,7 +1844,12 @@ struct page *__rmqueue_smallest(struct z
  		area = &(zone->free_area[current_order]);
  		page = list_first_entry_or_null(&area->free_list[migratetype],
  							struct page, lru);
-		if (!page)
+		/*
+		 * Continue if no page is found or if our freelist contains
+		 * less than the minimum pages of that order. In that case
+		 * we better look for a different order.
+		 */
+		if (!page || area->nr_free < area->min)
  			continue;
  		list_del(&page->lru);
  		rmv_page_order(page);
This is surprising to say the least. Assuming reservations are at order-3,
this would refuse to split order-3 even if there was sufficient reserved
pages at higher orders for a reserve.

Hi Mel,

I agree with you that the above code does not really do what it should.

At least, the condition needs to be changed to:

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 76c9688b6a0a..193dfd85a6b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1837,7 +1837,15 @@ struct page *__rmqueue_smallest(struct zone 
*zone, unsigned int order,
                area = &(zone->free_area[current_order]);
                page = 
list_first_entry_or_null(&area->free_list[migratetype],
                                                        struct page, lru);
-               if (!page)
+               /*
+                * Continue if no page is found or if we are about to
+                * split a truly higher order than requested.
+                * There is no limit for just _using_ exactly the right
+                * order. The limit is only for _splitting_ some
+                * higher order.
+                */
+               if (!page ||
+                   (area->nr_free < area->min && current_order > order))
                        continue;
                list_del(&page->lru);
                rmv_page_order(page);


The "&& current_order > order" part is _crucial_. If left out, it will 
work even counter-productive. I know this from development of my 
original patch some years ago.

Please have a look at the attached patchset for kernel 3.16 which is in 
_production_ at 1&1 Internet SE at about 20,000 servers for several 
years now, starting from kernel 3.2.x to 3.16.x (or maybe the very first 
version was for 2.6.32, I don't remember exactly).

It has collected several millions of operation hours in total, and it is 
known to work miracles for some of our workloads.

Porting to later kernels should be relatively easy. Also notice that the 
switch labels at patch #2 could need some minor tweaking, e.g. also 
including ZONE_DMA32 or similar, and also might need some 
architecture-specific tweaking. All of the tweaking is depending on the 
actual workload. I am using it only at datacenter servers (webhosting) 
and at x86_64.

Please notice that the user interface of my patchset is extremely simple 
and can be easily understood by junior sysadmins:

After running your box for several days or weeks or even months (or 
possibly, after you just got an OOM), just do
# cat /proc/sys/vm/perorder_statistics > /etc/defaults/my_perorder_reserve

Then add a trivial startup script, e.g. to systemd or to sysv init etc, 
which just does the following early during the next reboot:
# cat /etc/defaults/my_perorder_reserve > /proc/sys/vm/perorder_reserve

That's it.

No need for a deep understanding of the theory of the memory 
fragmentation problem.

Also no need for adding anything to the boot commandline. Fragmentation 
will typically occur only after some days or weeks or months of 
operation, at least in all of the practical cases I have personally seen 
at 1&1 datacenters and their workloads.

Please notice that fragmentation can be a very serious problem for 
operations if you are hurt by it. It can seriously harm your business. 
And it is _extremely_ specific to the actual workload, and to the 
hardware / chipset / etc. This is addressed by the above method of 
determining the right values from _actual_ operations (not from 
speculation) and then memoizing them.

The attached patchset tries to be very simple, but in my practical 
experience it is a very effective practical solution.

When requested, I can post the mathematical theory behind the patch, or 
I could give a presentation at some of the next conferences if I would 
be invited (or better give a practical explanation instead). But 
probably nobody on these lists wants to deal with any theories.

Just _play_ with the patchset practically, and then you will notice.

Cheers and greetings,

Yours sincerly old-school hacker Thomas


P.S. I cannot attend these lists full-time due to my workload at 1&1 
which is unfortunately not designed for upstream hacking, so please stay 
patient with me if an answer takes a few days.


>From ba501464c81e50ccb91a584f124d3bff6c0eccbd Mon Sep 17 00:00:00 2001
From: Thomas Schoebel-Theuer <schoebel@xxxxxxxxx>
Date: Wed, 6 Mar 2013 10:45:09 +0100
Subject: [PATCH 1/4] mm: fix fragmentation by pre-reserving higher-order pages

>From the literature about buddy system allocators, it is
known that such systems have conceptual problems with
de-fragmentation of higher-order pages. Once a very high-order
page has been split into smaller pieces, the _probability_
that this particular page can be re-assembled at some later
point in time becomes almost zero.

For example, the probability that a 10-order page can be re-assembled
from 1024 single pieces, each having order 0,
is _extremely_ small. Because each of them has a uniquely pre-defined
position in physical memory, which cannot be exchanged with any
other position, and each of them has to be at least reclaimable.

The problem becomes exponentially worse with growing order.

Methods like migration types can help by lowering the base of
of the exponentiation at determining the probability, but they
cannnot lower the exponent as such.

It is known from the literature that this exponential behaviour is
a general drawback of buddy systems that cannot be circumvented at all,
other than never splitting some number of reserved
higher-order pages (some variant of a pre-allocation strategy).

It is also known that pre-reservation is the only _reliable_
way for avoiding higher-order OOM if you know the maximum
number of in-use pages for each order > 0 in advance.

This simple patch introduces /proc/sys/vm/perorder_reserve
which contains an int vector describing the minimum number of
pages, for each order, which will never be split at all.

The int vector starts with order 0 and goes up to MAX_ORDER-1.
Notice that setting the number for order 0 does not make sense,
and therefore has no effect. Depending on your workload, you should
adjust the values in /proc/sys/vm/perorder_reserve right after
a fresh reboot. Increasing the values for very high-order pages
when all memory has been used may be already too late (especially
recommended for the server folks).
---
 include/linux/mmzone.h |  2 ++
 kernel/sysctl.c        | 10 ++++++++++
 mm/Kconfig             | 32 +++++++++++++++++++++++++++++++
 mm/page_alloc.c        | 52 ++++++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9fe9377821a5..e958040fa8a0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -920,6 +920,8 @@ int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
 int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 
+extern int sysctl_perorder_reserve[MAX_ORDER];
+
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 extern char numa_zonelist_order[];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e918a641b1a0..8231339fb9d9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1287,6 +1287,16 @@ static struct ctl_table vm_table[] = {
 		.mode		= 0644,
 		.proc_handler	= lowmem_reserve_ratio_sysctl_handler,
 	},
+#ifdef CONFIG_PERORDER_RESERVE
+	{
+		.procname	= "perorder_reserve",
+		.data		= &sysctl_perorder_reserve,
+		.maxlen		= sizeof(sysctl_perorder_reserve),
+		.mode		= 0644,
+		.proc_handler	= min_free_kbytes_sysctl_handler,
+		.extra1		= &zero,
+	},
+#endif
 	{
 		.procname	= "drop_caches",
 		.data		= &sysctl_drop_caches,
diff --git a/mm/Kconfig b/mm/Kconfig
index 3e9977a9d657..a381c879573a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -270,6 +270,38 @@ config MIGRATION
 config ARCH_ENABLE_HUGEPAGE_MIGRATION
 	boolean
 
+#
+# memory pre-reservation support
+#
+config PERORDER_RESERVE
+	bool "pre-reserve higher order pages"
+	def_bool y
+	help
+	  Avoid page migration / compaction and finally OOM situations
+	  by keeping a reserve of higher-order pages which are never split.
+	  You can tune the per-order reserve via /proc/sys/vm/perorder_reserve
+	  or the corresponding sysctl.
+	  There you will find an int vector indexed by the page order,
+	  ranging from index 0 to index MAX_ORDER-1.
+	  Notice that changing index 0 is meaningless and has no effect.
+	  But the numbers at higher indexes can drastically help if
+	  you encounter OOM problems at higher-order pages.
+	  Increasing that numbers should be done right after any fresh reboot.
+	  Later reserving is possible, but bears the risk that it may be
+	  already too late for getting large contiguous memory.
+	  Theory says that this kind of pre-reservation is the only
+	  _reliable_ method for avoiding higher-order OOM of the
+          central buddy allocator, if you know for each order > 0 in advance
+	  the maximum number of pages which are in use at any time.
+	  In order to determine such maximum numbers, while not unnecessarily
+	  wasting too much space, you may have to experiment with those
+	  numbers for some time (typically memory fragmentation will show
+	  up only after a few weeks or even months of high server load).
+	  In general, the optimum numbers depend not only on the load, but
+	  also on the hardware / chipsets and their higher-order
+          allocation requirements.
+	  Notice: this works independently from min_free_kbytes and friends.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c1d0eec78ba..36a4260bcd2c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -190,6 +190,15 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 	 32,
 };
 
+#ifdef CONFIG_PERORDER_RESERVE
+/*
+ * Default values for the perorder reserve.
+ */
+int sysctl_perorder_reserve[MAX_ORDER] = {
+};
+EXPORT_SYMBOL_GPL(sysctl_perorder_reserve);
+#endif
+
 EXPORT_SYMBOL(totalram_pages);
 
 static char * const zone_names[MAX_NR_ZONES] = {
@@ -938,6 +947,29 @@ static int prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags)
 }
 
 /*
+ * Test whether the per-order reserve would be used.
+ * The per-order reserve protects higher-order pages from being split
+ * by lower-order requests.
+ * This means, the per-order reserve can only be used for allocations
+ * having _exactly_ the originally requested order.
+ */
+static inline
+bool __is_order_reserved(struct zone *zone,
+			unsigned int orig_order,
+			unsigned int current_order,
+			int migratetype)
+{
+#ifdef CONFIG_PERORDER_RESERVE
+	if (current_order != orig_order) {
+		long nr_free = zone->free_area[current_order].nr_free;
+		long allowed = sysctl_perorder_reserve[current_order];
+		return nr_free <= allowed;
+	}
+#endif
+	return false;
+}
+
+/*
  * Go through the free lists for the given migratetype and remove
  * the smallest available page from the freelists
  */
@@ -952,7 +984,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
 	/* Find a page of the appropriate size in the preferred list */
 	for (current_order = order; current_order < MAX_ORDER; ++current_order) {
 		area = &(zone->free_area[current_order]);
-		if (list_empty(&area->free_list[migratetype]))
+		if (list_empty(&area->free_list[migratetype]) ||
+		    __is_order_reserved(zone, order, current_order, migratetype))
 			continue;
 
 		page = list_entry(area->free_list[migratetype].next,
@@ -1140,7 +1173,8 @@ __rmqueue_fallback(struct zone *zone, unsigned int order, int start_migratetype)
 				break;
 
 			area = &(zone->free_area[current_order]);
-			if (list_empty(&area->free_list[migratetype]))
+			if (list_empty(&area->free_list[migratetype]) ||
+			    __is_order_reserved(zone, order, current_order, migratetype))
 				continue;
 
 			page = list_entry(area->free_list[migratetype].next,
@@ -5677,9 +5711,21 @@ static void __setup_per_zone_wmarks(void)
 {
 	unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10);
 	unsigned long lowmem_pages = 0;
+	unsigned long pages_min_increase = 0;
 	struct zone *zone;
 	unsigned long flags;
 
+#ifdef CONFIG_PERORDER_RESERVE
+	unsigned long base = 1;
+	int i;
+
+	/* take sysctl_perorder_reserve[] into account */
+	for (i = 0; i < MAX_ORDER; i++) {
+		pages_min_increase += sysctl_perorder_reserve[i] * base;
+		base <<= 1;
+	}
+#endif
+
 	/* Calculate total number of !ZONE_HIGHMEM pages */
 	for_each_zone(zone) {
 		if (!is_highmem(zone))
@@ -5692,6 +5738,7 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
+		tmp += pages_min_increase;
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -5706,6 +5753,7 @@ static void __setup_per_zone_wmarks(void)
 
 			min_pages = zone->managed_pages / 1024;
 			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
+			min_pages += pages_min_increase;
 			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
-- 
2.12.3

>From f00f5292df0b2e7934ae45c603bfb78f9012da0d Mon Sep 17 00:00:00 2001
From: Thomas Schoebel-Theuer <tst@xxxxxxxxxxxxxxxxxx>
Date: Sun, 11 Feb 2018 11:04:14 +0100
Subject: [PATCH 2/4] mm: restrict per-order reservation to particular zones

Other than for testing, pre-reservations of higher-order pages
should be only applied to those memory zones where it it is
really needed for avoiding memory fragmentation.

This patch restricts it to ZONE_NORMAL.
Other strategies can be added later, easily.
---
 mm/page_alloc.c | 34 +++++++++++++++++++++++++++++++---
 1 file changed, 31 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 36a4260bcd2c..475b0d9b4f34 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -197,6 +197,32 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 int sysctl_perorder_reserve[MAX_ORDER] = {
 };
 EXPORT_SYMBOL_GPL(sysctl_perorder_reserve);
+
+static inline
+bool is_perorder_zone_enabled(struct zone *zone)
+{
+	int type = zone_idx(zone);
+
+	/*
+	 * DISCUSS: the perorder pre-reservation is typically not needed
+	 * in all memory zones.
+	 * Because memory load patterns may change over decades, and
+	 * because there might be special use cases requiring different
+	 * treatment, here is an opportunity for future fine tuning.
+	 * It would be possible to refine this even more, but a full
+	 * cartesian product between page orders and all zones could
+	 * get quite complicated for sysadmins or even end users.
+	 * Don't make it too complex.
+	 */
+	switch (type) {
+	case ZONE_NORMAL:
+		return true;
+	default:
+		return false;
+	}
+}
+#else
+#define is_perorder_zone_enabled(zone) false
 #endif
 
 EXPORT_SYMBOL(totalram_pages);
@@ -960,7 +986,7 @@ bool __is_order_reserved(struct zone *zone,
 			int migratetype)
 {
 #ifdef CONFIG_PERORDER_RESERVE
-	if (current_order != orig_order) {
+	if (current_order != orig_order && is_perorder_zone_enabled(zone)) {
 		long nr_free = zone->free_area[current_order].nr_free;
 		long allowed = sysctl_perorder_reserve[current_order];
 		return nr_free <= allowed;
@@ -5738,7 +5764,8 @@ static void __setup_per_zone_wmarks(void)
 		spin_lock_irqsave(&zone->lock, flags);
 		tmp = (u64)pages_min * zone->managed_pages;
 		do_div(tmp, lowmem_pages);
-		tmp += pages_min_increase;
+		if (is_perorder_zone_enabled(zone))
+			tmp += pages_min_increase;
 		if (is_highmem(zone)) {
 			/*
 			 * __GFP_HIGH and PF_MEMALLOC allocations usually don't
@@ -5753,7 +5780,8 @@ static void __setup_per_zone_wmarks(void)
 
 			min_pages = zone->managed_pages / 1024;
 			min_pages = clamp(min_pages, SWAP_CLUSTER_MAX, 128UL);
-			min_pages += pages_min_increase;
+			if (is_perorder_zone_enabled(zone))
+				min_pages += pages_min_increase;
 			zone->watermark[WMARK_MIN] = min_pages;
 		} else {
 			/*
-- 
2.12.3

>From 7208504d46e88aa63731c113ccec0888e01cc256 Mon Sep 17 00:00:00 2001
From: Thomas Schoebel-Theuer <schoebel@xxxxxxxxx>
Date: Fri, 8 Mar 2013 11:58:28 +0100
Subject: [PATCH 3/4] mm: collect per-order statistics

Make the actual allocations transparent to the end user.

This patch adds /proc/perorder_statistics and /proc/perorder_inuse
showing the current per-order allocation, as well as the maximum
peaks since bootup.

Userspace applications can use this for automated fine tuning.
---
 include/linux/mmzone.h |  2 ++
 kernel/sysctl.c        | 16 ++++++++++++++++
 mm/Kconfig             | 17 +++++++++++++++++
 mm/page_alloc.c        | 39 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 74 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e958040fa8a0..1a96f3d6d290 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -921,6 +921,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
 
 extern int sysctl_perorder_reserve[MAX_ORDER];
+extern int sysctl_perorder_statistics[MAX_ORDER];
+extern int sysctl_perorder_inuse[MAX_ORDER];
 
 extern int numa_zonelist_order_handler(struct ctl_table *, int,
 			void __user *, size_t *, loff_t *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 8231339fb9d9..eb1b76def032 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1297,6 +1297,22 @@ static struct ctl_table vm_table[] = {
 		.extra1		= &zero,
 	},
 #endif
+#ifdef CONFIG_PERORDER_STATISTICS
+	{
+		.procname	= "perorder_statistics",
+		.data		= &sysctl_perorder_statistics,
+		.maxlen		= sizeof(sysctl_perorder_statistics),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec_minmax,
+	},
+	{
+		.procname	= "perorder_inuse",
+		.data		= &sysctl_perorder_inuse,
+		.maxlen		= sizeof(sysctl_perorder_inuse),
+		.mode		= 0444,
+		.proc_handler	= proc_dointvec_minmax,
+	},
+#endif
 	{
 		.procname	= "drop_caches",
 		.data		= &sysctl_drop_caches,
diff --git a/mm/Kconfig b/mm/Kconfig
index a381c879573a..82878697f2fd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -302,6 +302,23 @@ config PERORDER_RESERVE
           allocation requirements.
 	  Notice: this works independently from min_free_kbytes and friends.
 
+config PERORDER_STATISTICS
+	bool "usage statistics on higher order pages"
+	def_bool n
+	depends on PERORDER_RESERVE
+	help
+	  This can be used to dermine the maximum number of higher-order
+	  page allocations occurring at runtime. Results are displayed
+	  in /proc/sys/vm/perorder_statistics .
+	  This feature adds a very slight overhead to the page allocation.
+	  If you cannot afford this, use this feature only in pre-life and
+	  pilot systems (but with real-life or near-real-life loads).
+	  After running your test system for some weeks or months, you
+	  can directly use the numbers from  /proc/sys/vm/perorder_statistics
+	  for initialization of /proc/sys/vm/perorder_reserve in
+	  real-life systems (optionally after adding some safety margins).
+	  If unsure, say N.
+
 config PHYS_ADDR_T_64BIT
 	def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 475b0d9b4f34..554c55bc6ec5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -225,6 +225,29 @@ bool is_perorder_zone_enabled(struct zone *zone)
 #define is_perorder_zone_enabled(zone) false
 #endif
 
+#ifdef CONFIG_PERORDER_STATISTICS
+
+int sysctl_perorder_statistics[MAX_ORDER] = { };
+int sysctl_perorder_inuse[MAX_ORDER] = { };
+static spinlock_t perorder_lock[MAX_ORDER];
+
+static inline
+void perorder_stat(int order, int delta)
+{
+	if (order > 0 && order < MAX_ORDER) {
+		unsigned long flags;
+
+		spin_lock_irqsave(&perorder_lock[order], flags);
+		sysctl_perorder_inuse[order] += delta;
+		if (delta > 0 && sysctl_perorder_inuse[order] > sysctl_perorder_statistics[order])
+			sysctl_perorder_statistics[order] = sysctl_perorder_inuse[order];
+		spin_unlock_irqrestore(&perorder_lock[order], flags);
+	}
+}
+#else
+#define perorder_stat(o,x) /* empty */
+#endif
+
 EXPORT_SYMBOL(totalram_pages);
 
 static char * const zone_names[MAX_NR_ZONES] = {
@@ -615,6 +638,8 @@ static inline void __free_one_page(struct page *page,
 
 	VM_BUG_ON(migratetype == -1);
 
+	perorder_stat(order, -1);
+
 	page_idx = pfn & ((1 << MAX_ORDER) - 1);
 
 	VM_BUG_ON_PAGE(page_idx & ((1 << order) - 1), page);
@@ -838,6 +863,9 @@ void __init __free_pages_bootmem(struct page *page, unsigned int order)
 	page_zone(page)->managed_pages += nr_pages;
 	set_page_refcounted(page);
 	__free_pages(page, order);
+
+	// bootmem must not be counted, so compensate for it
+	perorder_stat(order, +1);
 }
 
 #ifdef CONFIG_CMA
@@ -2161,6 +2189,7 @@ this_zone_full:
 		 * for !PFMEMALLOC purposes.
 		 */
 		page->pfmemalloc = !!(alloc_flags & ALLOC_NO_WATERMARKS);
+		perorder_stat(order, +1);
 
 	return page;
 }
@@ -4993,6 +5022,16 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		memmap_init(size, nid, j, zone_start_pfn);
 		zone_start_pfn += size;
 	}
+#ifdef CONFIG_PERORDER_STATISTICS
+	{
+		int order;
+		for (order = 0; order < MAX_ORDER; order++) {
+			spin_lock_init(&perorder_lock[order]);
+			sysctl_perorder_inuse[order] = 0;
+			sysctl_perorder_statistics[order] = 0;
+		}
+	}
+#endif
 }
 
 static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
-- 
2.12.3

>From 90b1c30ffb1eae2c2cebb949025ed98257e91453 Mon Sep 17 00:00:00 2001
From: Thomas Schoebel-Theuer <schoebel@xxxxxxxxx>
Date: Wed, 6 Mar 2013 10:45:09 +0100
Subject: [PATCH 4/4] mm: 1&1-specific initialization values

Just for demo.
This could be left out for upstreaming.
---
 mm/page_alloc.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 554c55bc6ec5..d675f2f3241c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -193,8 +193,28 @@ int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = {
 #ifdef CONFIG_PERORDER_RESERVE
 /*
  * Default values for the perorder reserve.
+ * This is just a guess, working well for some specific servers at 1&1,
+ * while trying to reserve not too much memory for typical workstation loads.
+ * For other server hardware / applications, other values might be required.
+ * In particular, extremly high network load (e.g. from some 10Gbit interfaces)
+ * may require to bump these numbers even more, in order to reduce
+ * the number of higher-order OOM situations.
+ * TBD: dynamically adjust the startup values to different workloads
+ * (e.g. server / workstations) and memory sizes. However, it is
+ * difficult to guess the right values in advance. After all, you will
+ * need some longer experience to find out the right values, in any case.
+ * It often depends not only on the load, but also on the hardware / chipset
+ * whether higher-order pages are allocated in masses, or not.
  */
 int sysctl_perorder_reserve[MAX_ORDER] = {
+	[1] =  3000,
+	[2] =  1500,
+	[3] =    64,
+	[4] =    32,
+	[5] =    24,
+	[6] =     4,
+	[7] =     1,
+	[8] =    96,
 };
 EXPORT_SYMBOL_GPL(sysctl_perorder_reserve);
 
-- 
2.12.3