Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

Matthew Wilcox <willy@xxxxxxxxxxxxx> · Wed, 31 Mar 2021 04:09:35 +0100

On Tue, Mar 30, 2021 at 11:02:07AM -0700, Roman Gushchin wrote:
> On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
> > On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
> > > I actually ran a large scale experiment (on tens of thousands of machines) in the last
> > > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
> > 
> > Thanks for the information. I finally have time to come back to this. Do you mind sharing
> > the total memory of these machines? I want to have some idea on the scale of this issue to
> > make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
> > or TBs memory?
> 
> There are different configurations, but in general they are in 100's GB or smaller.

Are you using ZONE_MOVEABLE?  Seeing /proc/buddyinfo from one of these
machines might be illuminating.

> > 
> > >
> > > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> > > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> > > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> > > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
> > 
> > Is there a way of replicating such an environment with publicly available software?
> > I really want to understand the root cause and am willing to find a possible solution.
> > It would be much easier if I can reproduce this locally.
> 
> There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
> allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
> is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
> the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).

I think this is somewhere the buddy allocator could be improved.
Of course, it knows nothing of larger page orders (which needs to be
fixed), but in general, I would like it to do a better job of segregating
movable and unmovable allocations.

Let's take a machine with 100GB of memory as an example.  Ideally,
unmovable allocations would start at 4GB (assuming below 4GB is
ZONE_DMA32).  Movable allocations can allocate anywhere in memory, but
should avoid being "near" unmovable allocations.  Perhaps they start
at 5GB.  When unmovable allocations get up to 5GB, we should first exert
a bit of pressure to shrink the unmovable allocations (looking at you,
dcache), but eventually we'll need to grow the unmovable allocations
above 5GB and we should move, say, all the pages between 5GB and 5GB+1MB.
If this unmovable allocation was just temporary, we get a reassembled
1MB page.  If it was permanent, we now have 1MB of memory to soak up
the next few allocations.

The model I'm thinking of here is that we have a "line" in memory that
divides movable and unmovable allocations.  It can move up, but there
has to be significant memory pressure to do so.