On Tue, Aug 06, 2013 at 12:08:38PM -0400, Johannes Weiner wrote: > Okay, bear with me here: > > After an allocation, the watermark has to be met, all available pages > considered. That is reasonable because we don't want to deadlock and > order-0 pages can be served from any page block. > > Now say we have an order-2 allocation: in addition to the order-0 view > on the watermark, after the allocation a quarter of the watermark has > to be met with order-2 pages and up. Okay, maybe we always want a few > < PAGE_ALLOC_COSTLY_ORDER pages at our disposal, who knows. > > Then it kind of sizzles out towards higher order pages but it always > requires the remainder to be positive, i.e. at least one page at the > checked order available. Only the actually requested order is not > checked, so for an order-9 we only make sure that we could serve at > least one order-8 page, maybe more depending on the zone size. > > You're proposing to check at least for > > (watermark - min_watermark) >> (pageblock_order >> 2) > > worth of order-8 pages instead. worth of order 9 plus all higher order than 9, pages. > > How does any of this make any sense? The loop removes from the total number of free pages, all free pages in the too-low buddy allocator orders to be relevant for an order 9 allocation (order from 0 to 8 range), so the check is against order 9 pages or higher order of free memory. But it's still good to review this in detail, as this is one tricky bit of math. The loop independently of the above (to correcting the free_pages to only account order 9 or higher), scales down the watermark that we'll check the corrected free_pages level against. With this change now we stop the scaling down after we reach o >= 2. (pageblock_order >> 2 == 2). So we remain identical and we still scale down (shift right) once for order 1, and we scale down twice for order 2 like before. But for all higher orders we scale down the maximum twice and not more: (mark - min) >> 2 And we set to 0 the wmark of high order pages required for the ALLOC_WMARK_MIN. That means the MIN wmark will succeed and stop compaction if there's 1 page of the right order, not more than that. (free_pages <= min will not succeed the allocation, which means free_pages <= 0 because min is 0, and if instead it's > 0 it means at least 1 order 9 page has been generated which is what we need at the MIN level) About the ALLOC_WMARK_LOW, my current normal zone has min 11191 and low 13988. So the amount of free memory in order9 or higher zone required for the ALLOC_WMARK_LOW to succeed in compaction now becomes ((13988-11191)>>2)*4096 = 2.7MB. That means "free_pages > 2.7MB" will succeed with order 9 or higher, only with at least 2 pages generated in compaction, to have at least a bit more of margin than for min wmark and for concurrent allocations that could split or allocate the order 9 page generated. 2 pages better than one, and it seems to make a measurable difference in the allocation success rate here with all threads allocating from all CPUs at the same time. Before ALLOC_WMARK_LOW check for order 9 or higher was set at (13988 >> 9)*4096 = 110KB (same as zero, requiring only one order 9 page), and ALLOC_WMARK_MIN was set at (11191 >> 9)*4096 = 86KB (again same as zero, requiring just 1 page). In short this should allow at least 2 pages free at LOW for order 9 allocations, and still 1 for MIN like before. But it should be mostly beneficial in reducing compaction overhead for the lower orders than 9, where previously we would stop compaction much later for the MIN and for the LOW, despite nobody (except a bit atomic allocations) could access to the pages below the MIN. We were working for nothing. And potentially we did more work but we had fewer pages usable than we do now in between LOW and MIN that we could really allocate as high order pages. I could have used PAGE_ALLOC_COSTLY_ORDER instead of "pageblock_order >> 2" but I seen no obvious connection between this value and the fact the allocation is more costly. Keeping it in function of the pageblock_order I thought was better because if the pageblock order is bigger than 9 (signaling a very big hugepage), we keep scaling down a little more to avoid generating too many of very huge ones. It's not black and white stuff anyway but it felt better than PAGE_ALLOC_COSTLY_ORDER. To be sure, I would suggest to write a script that prints the low and min wmarks to check against the high order free_pages, given the standard low/min wmarks, for all orders of allocations from 0 to MAX_ORDER, with new and old code. And to run it with different min/low wmarks as generated by default on very different zone sizes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>