[CC += linux-api@xxxxxxxxxxxxxxx] Since this is a kernel-user-space API change, could you please CC linux-api@ (and also for any future iterations of this patch series). The kernel source file Documentation/SubmitChecklist notes that all Linux kernel patches that change userspace interfaces should be CCed to linux-api@xxxxxxxxxxxxxxx, so that the various parties who are interested in API changes are informed. For further information, see https://www.kernel.org/doc/man-pages/linux-api-ml.html On 02/23/2018 04:03 AM, Christoph Lameter wrote: > rfc->v1 > - Use Thomas suggestion to change the test in __rmqueue_smallest > > Over time as the kernel is churning through memory it will break > up larger pages and as time progresses larger contiguous allocations > will no longer be possible. This is an approach to preserve these > large pages and prevent them from being broken up. > > This is useful for example for the use of jumbo pages and can > satify various needs of subsystems and device drivers that require > large contiguous allocation to operate properly. > > The idea is to reserve a pool of pages of the required order > so that the kernel is not allowed to use the pages for allocations > of a different order. This is a pool that is fully integrated > into the page allocator and therefore transparently usable. > > Control over this feature is by writing to /proc/zoneinfo. > > F.e. to ensure that 2000 16K pages stay available for jumbo > frames do > > echo "3=2000" >/proc/zoneinfo Huh, that's rather weird interface to use. Writing to a general statistics/info file for such specific functionality? Please no. > or throught the order=<page spec> on the kernel command line. > F.e. > > order=3=2000,4N2=500 > > These pages will be subject to reclaim etc as usual but will not > be broken up. > > One can then also f.e. operate the slub allocator with > 64k pages. Specify "slub_max_order=3 slub_min_order=3" on > the kernel command line and all slab allocator allocations > will occur in 32K page sizes. > > Note that this will reduce the memory available to the application > in some cases. Reclaim may occur more often. If more than > the reserved number of higher order pages are being used then > allocations will still fail as normal. > > In order to make this work just right one needs to be able to > know the workload well enough to reserve the right amount > of pages. This is comparable to other reservation schemes. > > Well that f.e brings up huge pages. You can of course > also use this to reserve those and can then be sure that > you can dynamically resize your huge page pools even after > a long time of system up time. > > The idea for this patch came from Thomas Schoebel-Theuer whom I met > at the LCA and who described the approach to me promising > a patch that would do this. Sadly he has vanished somehow. > However, he has been using this approach to support a > production environment for numerous years. > > So I redid his patch and this is the first draft of it. > > > Idea-by: Thomas Schoebel-Theuer <tst@xxxxxxxxxxxxxxxxxx> > > First performance tests in a virtual enviroment show > a hackbench improvement by 6% just by increasing > the page size used by the page allocator. That's IMHO a rather weak justification for introducing a new userspace API. What exactly has been set where? Could similar results be achieved by tuning highatomic reservers and/or min_free_kbytes? I especially wonder how much of the effects come from the associated watermarks adjustment (which can be affected by min_free_kbytes) and what is due to __rmqueue_smallest() changes. You changed the __rmqueue_smallest() condition since RFC per Thomas suggestion, but report the same results? > Signed-off-by: Christopher Lameter <cl@xxxxxxxxx> > > Index: linux/include/linux/mmzone.h > =================================================================== > --- linux.orig/include/linux/mmzone.h > +++ linux/include/linux/mmzone.h > @@ -96,6 +96,11 @@ extern int page_group_by_mobility_disabl > struct free_area { > struct list_head free_list[MIGRATE_TYPES]; > unsigned long nr_free; > + /* We stop breaking up pages of this order if less than > + * min are available. At that point the pages can only > + * be used for allocations of that particular order. > + */ > + unsigned long min; > }; > > struct pglist_data; > Index: linux/mm/page_alloc.c > =================================================================== > --- linux.orig/mm/page_alloc.c > +++ linux/mm/page_alloc.c > @@ -1848,8 +1848,15 @@ struct page *__rmqueue_smallest(struct z > area = &(zone->free_area[current_order]); > page = list_first_entry_or_null(&area->free_list[migratetype], > struct page, lru); > - if (!page) > + /* > + * Continue if no page is found or if our freelist contains > + * less than the minimum pages of that order. In that case > + * we better look for a different order. > + */ > + if (!page || (area->nr_free < area->min > + && current_order > order)) For watermarks we have various situations when we let a critical allocation bypass them to some extent, but this is a strict condition. That's potential for regressions. Well, also not a fan of this patch, TBH. It's rather ad-hoc and not backed up with results. Aside from the above points, I agree with the objections of others for the RFC posting. It's also rather awkward that watermarks are increased per the reservations, but when the reservations are "consumed" (nr_free < min && current_order == order), the increased watermarks are untouched. IMHO this further enlarges the effects of purely adjusted watermarks by this patch. Vlastimil (leaving the rest of quoted mail for linux-api readers) > continue; > + > list_del(&page->lru); > rmv_page_order(page); > area->nr_free--; > @@ -5194,6 +5201,57 @@ static void build_zonelists(pg_data_t *p > > #endif /* CONFIG_NUMA */ > > +int set_page_order_min(int node, int order, unsigned min) > +{ > + int i, o; > + long min_pages = 0; /* Pages already reserved */ > + long managed_pages = 0; /* Pages managed on the node */ > + struct zone *last = NULL; > + unsigned remaining; > + > + /* > + * Determine already reserved memory for orders > + * plus the total of the pages on the node > + */ > + for (i = 0; i < MAX_NR_ZONES; i++) { > + struct zone *z = &NODE_DATA(node)->node_zones[i]; > + if (managed_zone(z)) { > + for (o = 0; o < MAX_ORDER; o++) { > + if (o != order) > + min_pages += z->free_area[o].min << o; > + > + } > + managed_pages += z->managed_pages; > + } > + } > + > + if (min_pages + (min << order) > managed_pages / 2) > + return -ENOMEM; > + > + /* Set the min values for all zones on the node */ > + remaining = min; > + for (i = 0; i < MAX_NR_ZONES; i++) { > + struct zone *z = &NODE_DATA(node)->node_zones[i]; > + if (managed_zone(z)) { > + u64 tmp; > + > + tmp = (u64)z->managed_pages * (min << order); > + do_div(tmp, managed_pages); > + tmp >>= order; > + z->free_area[order].min = tmp; > + > + last = z; > + remaining -= tmp; > + } > + } > + > + /* Deal with rounding errors */ > + if (remaining && last) > + last->free_area[order].min += remaining; > + > + return 0; > +} > + > /* > * Boot pageset table. One per cpu which is going to be used for all > * zones and all nodes. The parameters will be set in such a way > @@ -5428,6 +5486,7 @@ static void __meminit zone_init_free_lis > for_each_migratetype_order(order, t) { > INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); > zone->free_area[order].nr_free = 0; > + zone->free_area[order].min = 0; > } > } > > @@ -7002,6 +7061,7 @@ static void __setup_per_zone_wmarks(void > unsigned long pages_min = min_free_kbytes >> (PAGE_SHIFT - 10); > unsigned long lowmem_pages = 0; > struct zone *zone; > + int order; > unsigned long flags; > > /* Calculate total number of !ZONE_HIGHMEM pages */ > @@ -7016,6 +7076,10 @@ static void __setup_per_zone_wmarks(void > spin_lock_irqsave(&zone->lock, flags); > tmp = (u64)pages_min * zone->managed_pages; > do_div(tmp, lowmem_pages); > + > + for (order = 0; order < MAX_ORDER; order++) > + tmp += zone->free_area[order].min << order; > + > if (is_highmem(zone)) { > /* > * __GFP_HIGH and PF_MEMALLOC allocations usually don't > Index: linux/mm/vmstat.c > =================================================================== > --- linux.orig/mm/vmstat.c > +++ linux/mm/vmstat.c > @@ -27,6 +27,7 @@ > #include <linux/mm_inline.h> > #include <linux/page_ext.h> > #include <linux/page_owner.h> > +#include <linux/ctype.h> > > #include "internal.h" > > @@ -1614,6 +1615,11 @@ static void zoneinfo_show_print(struct s > zone_numa_state_snapshot(zone, i)); > #endif > > + for (i = 0; i < MAX_ORDER; i++) > + if (zone->free_area[i].min) > + seq_printf(m, "\nPreserve %lu pages of order %d from breaking up.", > + zone->free_area[i].min, i); > + > seq_printf(m, "\n pagesets"); > for_each_online_cpu(i) { > struct per_cpu_pageset *pageset; > @@ -1641,6 +1647,122 @@ static void zoneinfo_show_print(struct s > seq_putc(m, '\n'); > } > > +static int __order_protect(char *p) > +{ > + char c; > + > + do { > + int order = 0; > + int pages = 0; > + int node = 0; > + int rc; > + > + /* Syntax <order>[N<node>]=number */ > + if (!isdigit(*p)) > + return -EFAULT; > + > + while (true) { > + c = *p++; > + > + if (!isdigit(c)) > + break; > + > + order = order * 10 + c - '0'; > + } > + > + /* Check for optional node specification */ > + if (c == 'N') { > + if (!isdigit(*p)) > + return -EFAULT; > + > + while (true) { > + c = *p++; > + if (!isdigit(c)) > + break; > + node = node * 10 + c - '0'; > + } > + } > + > + if (c != '=') > + return -EINVAL; > + > + if (!isdigit(*p)) > + return -EINVAL; > + > + while (true) { > + c = *p++; > + if (!isdigit(c)) > + break; > + pages = pages * 10 + c - '0'; > + } > + > + if (order == 0 || order >= MAX_ORDER) > + return -EINVAL; > + > + if (!node_online(node)) > + return -ENOSYS; > + > + rc = set_page_order_min(node, order, pages); > + if (rc) > + return rc; > + > + } while (c == ','); > + > + if (c) > + return -EINVAL; > + > + setup_per_zone_wmarks(); > + > + return 0; > +} > + > +/* > + * Writing to /proc/zoneinfo allows to setup the large page breakup > + * protection. > + * > + * Syntax: > + * <order>[N<node>]=<number>{,<order>[N<node>]=<number>} > + * > + * F.e. Protecting 500 pages of order 2 (16K on intel) and 300 of > + * order 4 (64K) on node 1 > + * > + * echo "2=500,4N1=300" >/proc/zoneinfo > + * > + */ > +static ssize_t zoneinfo_write(struct file *file, const char __user *buffer, > + size_t count, loff_t *ppos) > +{ > + char zinfo[200]; > + int rc; > + > + if (count > sizeof(zinfo)) > + return -EINVAL; > + > + if (copy_from_user(zinfo, buffer, count)) > + return -EFAULT; > + > + zinfo[count - 1] = 0; > + > + rc = __order_protect(zinfo); > + > + if (rc) > + return rc; > + > + return count; > +} > + > +static int order_protect(char *s) > +{ > + int rc; > + > + rc = __order_protect(s); > + if (rc) > + printk("Invalid order=%s rc=%d\n",s, rc); > + > + return 1; > +} > +__setup("order=", order_protect); > + > /* > * Output information about zones in @pgdat. All zones are printed regardless > * of whether they are populated or not: lowmem_reserve_ratio operates on the > @@ -1672,6 +1794,7 @@ static const struct file_operations zone > .read = seq_read, > .llseek = seq_lseek, > .release = seq_release, > + .write = zoneinfo_write, > }; > > enum writeback_stat_item { > @@ -2016,7 +2139,7 @@ void __init init_mm_internals(void) > proc_create("buddyinfo", 0444, NULL, &buddyinfo_file_operations); > proc_create("pagetypeinfo", 0444, NULL, &pagetypeinfo_file_operations); > proc_create("vmstat", 0444, NULL, &vmstat_file_operations); > - proc_create("zoneinfo", 0444, NULL, &zoneinfo_file_operations); > + proc_create("zoneinfo", 0644, NULL, &zoneinfo_file_operations); > #endif > } > > Index: linux/include/linux/gfp.h > =================================================================== > --- linux.orig/include/linux/gfp.h > +++ linux/include/linux/gfp.h > @@ -543,6 +543,7 @@ void drain_all_pages(struct zone *zone); > void drain_local_pages(struct zone *zone); > > void page_alloc_init_late(void); > +int set_page_order_min(int node, int order, unsigned min); > > /* > * gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html