On 17.11.23 14:00, Gerald Schaefer wrote:
On Fri, 17 Nov 2023 00:08:31 +0100
David Hildenbrand <david@xxxxxxxxxx> wrote:
On 14.11.23 19:02, Sumanth Korikkar wrote:
The patch series implements "memmap on memory" feature on s390 and
provides the necessary fixes for it.
Thinking about this, one thing that makes s390x different from all the
other architectures in this series is the altmap handling.
I'm curious, why is that even required?
A memmep that is not marked as online in the section should not be
touched by anybody (except memory onlining code :) ). And if we do, it's
usually a BUG because that memmap might contain garbage/be poisoned or
completely stale, so we might want to track that down and fix it in any
So what speaks against just leaving add_memory() populate the memmap
from the altmap? Then, also the page tables for the memmap are already
in place when onlining memory.
Good question, I am not 100% sure if we ran into bugs, or simply assumed
that it is not OK to call __add_pages() when the memory for the altmap
is not accessible.
I mean, we create the direct map even though nobody should access that
memory, so maybe we can simply map the altmap even though nobody should
should access that memory.
As I said, then, even the page tables for the altmap are allocated
already and memory onlining likely doesn't need any allocation anymore
(except, there is kasan or some other memory notifiers have special
Certainly simpler :)
Maybe there is also already a common code bug with that, s390 might be
special but that is often also good for finding bugs in common code ...
If it's only the page_init_poison() as noted by Sumanth, we could
disable that on s390x with an altmap some way or the other; should be
I mean, you effectively have your own poisoning if the altmap is
effectively inaccessible and makes your CPU angry on access :)
Last but not least, support for an inaccessible altmap might come in
handy for virtio-mem eventually, and make altmap support eventually
simpler. So added bonus points.
Then, adding two new notifier calls on start of memory_block_online()
called something like MEM_PREPARE_ONLINE and end the end of
memory_block_offline() called something like MEM_FINISH_OFFLINE is still
suboptimal, but that's where standby memory could be
activated/deactivated, without messing with the altmap.
That way, the only s390x specific thing is that the memmap that should
not be touched by anybody is actually inaccessible, and you'd
activate/deactivate simply from the new notifier calls just the way we
used to do.
It's still all worse than just adding/removing memory properly, using a
proper interface -- where you could alloc/free an actual memmap when the
altmap is not desired. But I know that people don't want to spend time
just doing it cleanly from scratch.
Yes, sometimes they need to be forced to do that :-)
I certainly won't force you if we can just keep the __add_pages() calls
as is; having an altmap that is inaccessible but fully prepared sounds
reasonable to me.
I can see how this gives an immediate benefit to existing s390x
installations without being too hacky and without taking a long time to
But I'll strongly suggest to evaluate a new interface long-term.
So, we'll look into defining a "proper interface", and treat patches 1-3
separately as bug fixes? Especially patch 3 might be interesting for arm,
if they do not have ZONE_DEVICE, but still use the functions, they might
end up with the no-op version, not really freeing any memory.
It might make sense to
1) Send the first 3 out separately
2) Look into a simple variant that leaves __add_pages() calls alone and
only adds the new MEM_PREPARE_ONLINE/MEM_FINISH_OFFLINE notifiers --
well, and deals with an inaccessible altmap, like the
page_init_poison() when the altmap might be inaccessible.
3) Look into a proper interface to add/remove memory instead of relying
2) is certainly an improvement and might be desired in some cases. 3) is
more powerful (e.g., where you don't want an altmap because of
fragmentation) and future proof.
I suspect there will be installations where an altmap is undesired: it
fragments your address space with unmovable (memmap) allocations.
Currently, runtime allocations of gigantic pages are affected. Long-term
other large allocations (if we ever see very large THP) will be affected.
For that reason, we want to either support variable-sized memory blocks
long-term, or simulate that by "grouping" memory blocks that share a
same altmap located on the first memory blocks in that group: but
onlining one block forces onlining of the whole group.
On s390x that adds all memory ahead of time, it's hard to make a
decision what the right granularity will be, and seeing sudden
online/offline changed behavior might be quite "surprising" for users.
The user can give better hints when adding/removing memory explicitly.
David / dhildenb