Re: [RFC 1/2] Protect larger order pages from breaking up

Mike Kravetz <mike.kravetz@xxxxxxxxxx> · Fri, 16 Feb 2018 10:59:19 -0800

On 02/16/2018 08:01 AM, Christoph Lameter wrote:
> Over time as the kernel is churning through memory it will break
> up larger pages and as time progresses larger contiguous allocations
> will no longer be possible. This is an approach to preserve these
> large pages and prevent them from being broken up.
> 
> This is useful for example for the use of jumbo pages and can
> satify various needs of subsystems and device drivers that require
> large contiguous allocation to operate properly.
> 
> The idea is to reserve a pool of pages of the required order
> so that the kernel is not allowed to use the pages for allocations
> of a different order. This is a pool that is fully integrated
> into the page allocator and therefore transparently usable.
> 
> Control over this feature is by writing to /proc/zoneinfo.
> 
> F.e. to ensure that 2000 16K pages stay available for jumbo
> frames do
> 
> 	echo "2=2000" >/proc/zoneinfo
> 
> or through the order=<page spec> on the kernel command line.
> F.e.
> 
> 	order=2=2000,4N2=500
> 
> These pages will be subject to reclaim etc as usual but will not
> be broken up.
> 
> One can then also f.e. operate the slub allocator with
> 64k pages. Specify "slub_max_order=4 slub_min_order=4" on
> the kernel command line and all slab allocator allocations
> will occur in 64K page sizes.
> 
> Note that this will reduce the memory available to the application
> in some cases. Reclaim may occur more often. If more than
> the reserved number of higher order pages are being used then
> allocations will still fail as normal.
> 
> In order to make this work just right one needs to be able to
> know the workload well enough to reserve the right amount
> of pages. This is comparable to other reservation schemes.

Yes.

I like the idea that this only comes into play as the result of explicit
user/sysadmin action.  It does remind me of hugetlbfs reservations.  So,
we hope that only people who really know their workload and know what
they are doing would use this feature.

> Well that f.e brings up huge pages. You can of course
> also use this to reserve those and can then be sure that
> you can dynamically resize your huge page pools even after
> a long time of system up time.

Yes, and no.  Doesn't that assume nobody else is doing allocations
of that size?  For example, I could image THP using huge page sized
reservations.  The when it comes time to resize your hugetlbfs pool
there may not be enough.  Although, we may quickly split THP pages
in this case.  I am not sure.

IIRC, Guy Shattah's use case was for allocations greater than MAX_ORDER.
This would not directly address that.  A huge contiguous area (2GB) is
the sweet spot' for best performance in his case.  However, I think he
could still benefit from using a set of larger (such as 2MB) size
allocations which this scheme could help with.

-- 
Mike Kravetz
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html