[LSF/MM/BPF TOPIC] Improving alloc_contig_range()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

our range allocator -- alloc_contig_range() -- already works fairly reliable with MIGRATE_CMA, as used by the CMA allocator, and ZONE_MOVABLE, as used by virtio-mem for memory hotunplug. However, there are some things to improve, especially when allocating from one of the kernel zones, such as ZONE_NORMAL, as used for allocating gigantic pages and by virtio-mem for memory hotunplug.

a) MAX_ORDER (and pageblock_order) limitation

The current implementation is tightly glued to pageblock_order and MAX_ORDER. For example, alloc_contig_range() works fairly unreliable on ZONE_NORMAL with granularity < MAX_ORDER - 1, because we isolate all pageblocks in the MAX_ORDER - 1 range and any unmovable page in that range will bail out. Further, when isolating a pageblock we lose movability information, so isolating a (partially) unmovable pageblock might be problematic and we would like to retain the original movability information.

As one example, virtio-mem currently uses MAX_ORDER - 1 granularity instead of smaller (like pageblock_order) granularity, for example, supporting (un)plug of 4MiB chunks on x86-64 only. We'd like to support 2 MiB here.

As another example, a CMA area has to be aligned to MAX_ORDER - 1 due to the current limitations. pageblock_order is still problematic on some archs (arm64 with 64 KiB base pages), but getting rid of the MAX_ORDER limitation feels like a low hanging fruit.

As there is interest in increasing MAX_ORDER, the problem will get worse over time. The question are 1) what it takes to only isolate a single pageblock and not all pageblocks composing a MAX_ORDER - 1 range when not required and 2) how to handle isolating partially unmovable pageblocks.

b) Shrinking the slab

set_migratetype_isolate() has a nice comment "FIXME: Now, memory hotplug doesn't call shrink_slab() by itself". IIUC, we could significantly improve alloc_contig_range() reliability on ZONE_NORMAL when shrinking the slab in some environments. The questions are, 1) who should shrink the slab and 2) when, because it obviously can temporarily harm performance. However, memory hotunplug already temporarily harms performance.

Ideally, we'd want to shrink the slab only on the area of interest. How could something like that be realized?

c) PCP handling

While we disable the PCP right now when offlining memory to avoid races with concurrent freeing to the PCP, we don't do the same in alloc_contig_range(); instead, we only drain the PCP once.

Disabling the PCP will currently lock a mutex until re-enabled, which would essentially serialize alloc_contig_range(), which is undesired.

What would it take to make disabling the PCP scale? Do we care at all or can the races actually result in significant allocation failures, especially on ZONE_MOVABLE or MIGRATE_CMA?

d) Unification of alloc_contig_range() and memory offlining code.

Both do roughly the same thing, however, with some notable differences (dissolving huge pages, retry handling, ...). What does it take to unify both, or are there compelling reasons to not unify them?


--
Thanks,

David / dhildenb






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux