Re: [Chapter One] THP zones: the use cases of policy zones

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Feb 29, 2024 at 6:31 PM Yang Shi <shy828301@xxxxxxxxx> wrote:
>
> On Thu, Feb 29, 2024 at 10:34 AM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > There are three types of zones:
> > 1. The first four zones partition the physical address space of CPU
> >    memory.
> > 2. The device zone provides interoperability between CPU and device
> >    memory.
> > 3. The movable zone commonly represents a memory allocation policy.
> >
> > Though originally designed for memory hot removal, the movable zone is
> > instead widely used for other purposes, e.g., CMA and kdump kernel, on
> > platforms that do not support hot removal, e.g., Android and ChromeOS.
> > Nowadays, it is legitimately a zone independent of any physical
> > characteristics. In spite of being somewhat regarded as a hack,
> > largely due to the lack of a generic design concept for its true major
> > use cases (on billions of client devices), the movable zone naturally
> > resembles a policy (virtual) zone overlayed on the first four
> > (physical) zones.
> >
> > This proposal formally generalizes this concept as policy zones so
> > that additional policies can be implemented and enforced by subsequent
> > zones after the movable zone. An inherited requirement of policy zones
> > (and the first four zones) is that subsequent zones must be able to
> > fall back to previous zones and therefore must add new properties to
> > the previous zones rather than remove existing ones from them. Also,
> > all properties must be known at the allocation time, rather than the
> > runtime, e.g., memory object size and mobility are valid properties
> > but hotness and lifetime are not.
> >
> > ZONE_MOVABLE becomes the first policy zone, followed by two new policy
> > zones:
> > 1. ZONE_NOSPLIT, which contains pages that are movable (inherited from
> >    ZONE_MOVABLE) and restricted to a minimum order to be
> >    anti-fragmentation. The latter means that they cannot be split down
> >    below that order, while they are free or in use.
> > 2. ZONE_NOMERGE, which contains pages that are movable and restricted
> >    to an exact order. The latter means that not only is split
> >    prohibited (inherited from ZONE_NOSPLIT) but also merge (see the
> >    reason in Chapter Three), while they are free or in use.
> >
> > Since these two zones only can serve THP allocations (__GFP_MOVABLE |
> > __GFP_COMP), they are called THP zones. Reclaim works seamlessly and
> > compaction is not needed for these two zones.
> >
> > Compared with the hugeTLB pool approach, THP zones tap into core MM
> > features including:
> > 1. THP allocations can fall back to the lower zones, which can have
> >    higher latency but still succeed.
> > 2. THPs can be either shattered (see Chapter Two) if partially
> >    unmapped or reclaimed if becoming cold.
> > 3. THP orders can be much smaller than the PMD/PUD orders, e.g., 64KB
> >    contiguous PTEs on arm64 [1], which are more suitable for client
> >    workloads.
>
> I think the allocation fallback policy needs to be elaborated. IIUC,
> when allocating large folios, if the order > min order of the policy
> zones, the fallback policy should be ZONE_NOSPLIT/NOMERGE ->
> ZONE_MOVABLE    -> ZONE_NORMAL, right?

Correct.

> If all other zones are depleted, the allocation, whose order is < the
> min order, won't fallback to the policy zones and will fail, just like
> the non-movable allocation can't fallback to ZONE_MOVABLE even though
> there is enough memory for that zone, right?

Correct. In this case, the userspace can consider dynamic resizing.
(The resizing patches are not included since, as I said in the other
thread, we need to focus on the first few steps at the current stage.)

Naturally, the next question would be why creating this whole new
process rather than trying to improve compaction. We did try the
latter: on servers, we tuned compaction and had some good improvements
but soon hit a new wall; on clients, no luck at all because 1) they
are usually under a much higher pressure than servers 2) they are more
sensitive to latency.

So we needed a *more deterministic* approach when dealing with
fragmentation. Unlike compaction which I'd call heuristics, resizing
is more of a policy that the userspace can have full control over.
Obviously leaving the task to the userspace can be a good or bad
thing, depending on the point view.

The bottomline is:
1. The resizing would also help the *existing* unbalanced
ZONE_MOVABLE/other zones problem, for the non-hot-removal case.
2. Enlarging the THP zones would be more likely to succeed than
compaction would, because it targets the blocks it "donated" to
ZONE_MOVABLE with everything it got (both migration and reclaim) and
it keeps at it until it succeeds, whereas the compaction lacks such
laser focus and is more of a best-efforts approach.

(Needless to say, shrinking the THP zones can always succeed.)


> > Policy zones can be dynamically resized by offlining pages in one of
> > them and onlining those pages in another of them. Note that this is
> > only done among policy zones, not between a policy zone and a physical
> > zone, since resizing is a (software) policy, not a physical
> > characteristic.
> >
> > Implementing the same idea in the pageblock granularity has also been
> > explored but rejected at Google. Pageblocks have a finer granularity
> > and therefore can be more flexible than zones. The tradeoff is that
> > this alternative implementation was more complex and failed to bring a
> > better ROI. However, the rejection was mainly due to its inability to
> > be smoothly extended to 1GB THPs [2], which is a planned use case of
> > TAO.
> >
> > [1] https://lore.kernel.org/20240215103205.2607016-1-ryan.roberts@xxxxxxx/
> > [2] https://lore.kernel.org/20200928175428.4110504-1-zi.yan@xxxxxxxx/





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux