Hi,
our range allocator -- alloc_contig_range() -- already works fairly
reliable with MIGRATE_CMA, as used by the CMA allocator, and
ZONE_MOVABLE, as used by virtio-mem for memory hotunplug. However, there
are some things to improve, especially when allocating from one of the
kernel zones, such as ZONE_NORMAL, as used for allocating gigantic pages
and by virtio-mem for memory hotunplug.
a) MAX_ORDER (and pageblock_order) limitation
The current implementation is tightly glued to pageblock_order and
MAX_ORDER. For example, alloc_contig_range() works fairly unreliable on
ZONE_NORMAL with granularity < MAX_ORDER - 1, because we isolate all
pageblocks in the MAX_ORDER - 1 range and any unmovable page in that
range will bail out. Further, when isolating a pageblock we lose
movability information, so isolating a (partially) unmovable pageblock
might be problematic and we would like to retain the original movability
information.
As one example, virtio-mem currently uses MAX_ORDER - 1 granularity
instead of smaller (like pageblock_order) granularity, for example,
supporting (un)plug of 4MiB chunks on x86-64 only. We'd like to support
2 MiB here.
As another example, a CMA area has to be aligned to MAX_ORDER - 1 due to
the current limitations. pageblock_order is still problematic on some
archs (arm64 with 64 KiB base pages), but getting rid of the MAX_ORDER
limitation feels like a low hanging fruit.
As there is interest in increasing MAX_ORDER, the problem will get worse
over time. The question are 1) what it takes to only isolate a single
pageblock and not all pageblocks composing a MAX_ORDER - 1 range when
not required and 2) how to handle isolating partially unmovable pageblocks.
b) Shrinking the slab
set_migratetype_isolate() has a nice comment "FIXME: Now, memory hotplug
doesn't call shrink_slab() by itself". IIUC, we could significantly
improve alloc_contig_range() reliability on ZONE_NORMAL when shrinking
the slab in some environments. The questions are, 1) who should shrink
the slab and 2) when, because it obviously can temporarily harm
performance. However, memory hotunplug already temporarily harms
performance.
Ideally, we'd want to shrink the slab only on the area of interest. How
could something like that be realized?
c) PCP handling
While we disable the PCP right now when offlining memory to avoid races
with concurrent freeing to the PCP, we don't do the same in
alloc_contig_range(); instead, we only drain the PCP once.
Disabling the PCP will currently lock a mutex until re-enabled, which
would essentially serialize alloc_contig_range(), which is undesired.
What would it take to make disabling the PCP scale? Do we care at all or
can the races actually result in significant allocation failures,
especially on ZONE_MOVABLE or MIGRATE_CMA?
d) Unification of alloc_contig_range() and memory offlining code.
Both do roughly the same thing, however, with some notable differences
(dissolving huge pages, retry handling, ...). What does it take to unify
both, or are there compelling reasons to not unify them?
--
Thanks,
David / dhildenb