Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 24 Feb 2025 10:28:49 -0400

On Sun, Feb 23, 2025 at 08:51:27PM +0200, Mike Rapoport wrote:
> On Wed, Feb 12, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote:
> > 
> > > As I've mentioned off-list earlier, KHO in its current form is the lowest
> > > level of abstraction for state preservation and it is by no means is
> > > intended to provide complex drivers with all the tools necessary.
> > 
> > My point, is I think it is the wrong level of abstraction and the
> > wrong FDT schema. It does not and cannot solve the problems we know we
> > will have, so why invest anything into that schema?
> 
> Preserving a lot of random pages spread all over the place will be a
> problem no matter what. With kho_preserve_folio() the users will still need
> to save physical address of that folio somewhere, 

Yes of course. However the schema of each node now gets a choice for
how it does that. ie the iommu is probably going to just store the top
pointer of a page table and rely on the internal table pointers to
store the physical addresses.

My point is that fdt mem should not be *mandatory* in the schema
because it is inherently unscalable and not what we want.

> structure that FDT will point to. So either instead of "mem" properties we'll
> have "addresses" property or a pointer to yet another page that should be
> preserved and, by the way, "mem" may come handy in this case :)

I think the preservation of the memory should be completely
independent of the FDT scheme of the nodes. If ftrace wants a "mem"
then sure, but the core preservation code should not parse it.

Nodes should be free to select whatever serialization scheme they
want.

Memory preservation should be a seperate self-contained node with its
own schema version. They should not be mixed together.

There should be single API toward the drivers, there should not get
"automatic" preservation because they put magic stuff in the FDT.

> I don't see how the "mem" property contradicts future extensions and for
> simple use cases it is already enough.

You'd just have to throw out all this code parsing mem to build the
memblock.

It also makes the weird preallocation of the FDT and it's related
sysfs probably unnecessary as it seems largely driven by this
unbounded mem attribute problem.

> I did an experiment and preserved 1GiB of random order-0 pages and measured
> time required to reserve everything in memblock.
> kho_deserialize() you suggested slightly outperformed
> kho_init_reserved_pages() that parsed a single "mem" property containing
> an array of <addr, size> pairs.

It has to be considered end-to-end, there is more cost to build up the
FDT array, and copying it around as well. Your 16Gib of random order 0
pages is 64MB of FDT space to represent as 16 byte addr/len
pairs. That's alot of memory to be allocating, zeroing and copying
around three (or four/five?) times.

So if the bitmap parsing is already slightly faster I expect the whole
end-to-end solution will be notably faster.

> For more random distribution of orders and more deep FDT the
> difference or course would be higher, but still both options sucked
> relatively to a maple tree serialized similarly to your tracker
> xarray.

I didn't like a maple tree like thing because the worst case memory
requirements become much higher - and it is more expensive to build it
on the serializing side (you have to run maple tree algorithms per-4k,
and then copy it out of the maple tree to a representation). Maybe it
is better, but I'd defer to real data on real systems before deciding.

With the numbers I was working with there are 512k of bitmaps worst
case for 16G of memory. If you imagine encoding ranges in, say, 8
bytes per range (52 bits of phys_addr_t, 12 bits of length) then you
get about 65k of ranges in the same 512k. That is only enough to store
a random distribution of 256MB of 4k pages.

Still, I'd like the see the memory preservation have its own
independent scheme, so if there is a better approach it can be
upgraded as self-contained project. It should have no effect on the
schema of the other nodes, or API toward the drivers.

> > But why? Just do it right from the start? I spent like a hour
> > sketching that, the existing preservation code is also very simple,
> > why not just fix it right now?
> 
> As I see it, we can have both. "mem" property for simple use cases, or as a
> partial solution for complex use cases and tracker you proposed for
> preserving the order of the folios.

I don't think that is a good idea. Two ways it is unnecessarily
complicated. memory preservation should be integral to the system and
be done in one way that works well for all cases. We definately don't
want two APIs toward drivers for this.

If we have the bitmap then all drivers should be updated to use
it. The core code parsing of the mem schema should be removed.

> And as another optimization we may want a maple tree for coalescing as much
> as possible to reduce amount of memblock_reserve() calls.

Is the bitmap scanning really such a high cost? It can be coalescing
the set bitmap ranges with ffs/ffz if you want to run a
memblock_reserve() sort of thing.

However, I was not imagining using something as inefficient as
memblock_reserve() in the long run. It doesn't make sense to take a
bitmap and then convert it into ranges, parse the ranges to build up
the free list, then throw away the ranges.

Instead the bitmaps should be consulted as the free list is being
built up immediately after allocating the struct pages. No ranges
ever. 

I didn't try to show this because it is definately complicated, but
the serialize side has eveything indexed in xarrays so it can generate
a linear sorted list of 'de-serializing' instructions that are slices
of bitmaps of different orders. The code that builds the free list
would simply walk that linear list of instructions and not add memory
with set bits to the free list. Simple O(1) de-serialzing approach
with some cost on the serializing side

I think going through memblock_reserve() is a good starting point, but
there is certainly alot of room for improving away from using ranges.

Jason