On Mon, Oct 21, 2024 at 11:51:38AM +0200, David Hildenbrand wrote: > > > Am 16.10.24 um 21:24 schrieb Gregory Price: > > When physical address regions are not aligned to memory block size, > > the misaligned portion is lost (stranded capacity). > > > > Block size (min/max/selected) is architecture defined. Most architectures > > tend to use the minimum block size or some simplistic heurist. On x86, > > memory block size increases up to 2GB, and is otherwise fitted to the > > alignment of non-hotplug (special purpose memory). > > > > CXL exposes its memory for management through the ACPI CEDT (CXL Early > > Detection Table) in a field called the CXL Fixed Memory Window. Per > > the CXL specification, this memory must be aligned to at least 256MB. > > > > When a CFMW aligns on a size less than the block size, this causes a > > loss of up to 2GB per CFMW on x86. It is not uncommon for CFMW to be > > allocated per-device - though this behavior is BIOS defined. > > > > This patch set provides 3 things: > > 1) implement advise/probe functions in mm/memblock.c to report/probe > > architecture agnostic hotplug memory alignment advice. > > 2) update x86 memblock size logic to consider the hotplug advice > > 3) add code in acpi/numa/srat.c to report CFMW alignment advice > > > > The advisement interfaces are design to be called during arch_init > > code prior to allocator and smp_init. start_kernel will call these > > through setup_arch() (via acpi and mm/init_64.c on x86), which occurs > > prior to mm_core_init and smp_init - so no need for atomics. > > > > There's an attempt to signal callers to advise() that probe has already > > occurred, but this is predicated on the notion that probe() actually > > occurs (which presently only happens on x86). This is to assist debugging > > future users who may mistakenly call this after allocator or smp init. > > > > Likewise, if probe() occurs more than once, we return -EBUSY to prevent > > inconsistent values from being reported - i.e. this interaction should > > happen exactly once, and all other behavior is an error / the probed > > value should be acquired via memory_block_size_bytes() instead. > > > > Suggested-by: Ira Weiny <ira.weiny@xxxxxxxxx> > > Suggested-by: David Hildenbrand <david@xxxxxxxxxx> > > Suggested-by: Dan Williams <dan.j.williams@xxxxxxxxx> > > Signed-off-by: Gregory Price <gourry@xxxxxxxxxx> > > Just as a side note, a while ago there was a discussion about variable-sized > memory blocks -- essentially removing memory_block_size_bytes(). > If you have any links, happy to do some reading up on it. Was going to look into some more memblock behavior in the future so it's worth looking at. > > The main issue is that this would change /sys/devices/system/memory/ in ways > it could break existing user space. I believe there are other corner cases > that are a bit nasty to handle (e.g., removing parts of a larger memory > block), but likely it could be handled. > This is why I wanted to avoid a new interface in the first place and just piggyback on set_memory_block_size_order - now there are two interfaces to do the same thing and more hurdles. But I suppose the suggestive-nature of this one makes it far less offensive since it can be completely ignored. ~Gregory