[LSF/MM/BPF TOPIC] Improving alloc_contig_range()

David Hildenbrand <david@xxxxxxxxxx> · Wed, 9 Jun 2021 15:39:51 +0200

Hi,

our range allocator -- alloc_contig_range() -- already works fairly 
reliable with MIGRATE_CMA, as used by the CMA allocator, and 
ZONE_MOVABLE, as used by virtio-mem for memory hotunplug. However, there 
are some things to improve, especially when allocating from one of the 
kernel zones, such as ZONE_NORMAL, as used for allocating gigantic pages 
and by virtio-mem for memory hotunplug.

a) MAX_ORDER (and pageblock_order) limitation

The current implementation is tightly glued to pageblock_order and 
MAX_ORDER. For example, alloc_contig_range() works fairly unreliable on 
ZONE_NORMAL with granularity < MAX_ORDER - 1, because we isolate all 
pageblocks in the  MAX_ORDER - 1 range and any unmovable page in that 
range will bail out. Further, when isolating a pageblock we lose 
movability information, so isolating a (partially) unmovable pageblock 
might be problematic and we would like to retain the original movability 
information.

As one example, virtio-mem currently uses MAX_ORDER - 1 granularity 
instead of smaller (like pageblock_order) granularity, for example, 
supporting (un)plug of 4MiB chunks on x86-64 only. We'd like to support 
2 MiB here.

As another example, a CMA area has to be aligned to MAX_ORDER - 1 due to 
the current limitations. pageblock_order is still problematic on some 
archs (arm64 with 64 KiB base pages), but getting rid of the MAX_ORDER 
limitation feels like a low hanging fruit.

As there is interest in increasing MAX_ORDER, the problem will get worse 
over time. The question are 1) what it takes to only isolate a single 
pageblock and not all pageblocks composing a MAX_ORDER - 1 range when 
not required and 2) how to handle isolating partially unmovable pageblocks.

b) Shrinking the slab

set_migratetype_isolate() has a nice comment "FIXME: Now, memory hotplug 
doesn't call shrink_slab() by itself". IIUC, we could significantly 
improve alloc_contig_range() reliability on ZONE_NORMAL when shrinking 
the slab in some environments. The questions are, 1) who should shrink 
the slab and 2) when, because it obviously can temporarily harm 
performance. However, memory hotunplug already temporarily harms 
performance.

Ideally, we'd want to shrink the slab only on the area of interest. How 
could something like that be realized?

c) PCP handling

While we disable the PCP right now when offlining memory to avoid races 
with concurrent freeing to the PCP, we don't do the same in 
alloc_contig_range(); instead, we only drain the PCP once.

Disabling the PCP will currently lock a mutex until re-enabled, which 
would essentially serialize alloc_contig_range(), which is undesired.

What would it take to make disabling the PCP scale? Do we care at all or 
can the races actually result in significant allocation failures, 
especially on ZONE_MOVABLE or MIGRATE_CMA?

d) Unification of alloc_contig_range() and memory offlining code.

Both do roughly the same thing, however, with some notable differences 
(dissolving huge pages, retry handling, ...). What does it take to unify 
both, or are there compelling reasons to not unify them?

--
Thanks,

David / dhildenb