On Sun, Feb 23, 2025 at 08:51:27PM +0200, Mike Rapoport wrote: > On Wed, Feb 12, 2025 at 01:43:03PM -0400, Jason Gunthorpe wrote: > > On Wed, Feb 12, 2025 at 06:39:06PM +0200, Mike Rapoport wrote: > > > > > As I've mentioned off-list earlier, KHO in its current form is the lowest > > > level of abstraction for state preservation and it is by no means is > > > intended to provide complex drivers with all the tools necessary. > > > > My point, is I think it is the wrong level of abstraction and the > > wrong FDT schema. It does not and cannot solve the problems we know we > > will have, so why invest anything into that schema? > > Preserving a lot of random pages spread all over the place will be a > problem no matter what. With kho_preserve_folio() the users will still need > to save physical address of that folio somewhere, Yes of course. However the schema of each node now gets a choice for how it does that. ie the iommu is probably going to just store the top pointer of a page table and rely on the internal table pointers to store the physical addresses. My point is that fdt mem should not be *mandatory* in the schema because it is inherently unscalable and not what we want. > structure that FDT will point to. So either instead of "mem" properties we'll > have "addresses" property or a pointer to yet another page that should be > preserved and, by the way, "mem" may come handy in this case :) I think the preservation of the memory should be completely independent of the FDT scheme of the nodes. If ftrace wants a "mem" then sure, but the core preservation code should not parse it. Nodes should be free to select whatever serialization scheme they want. Memory preservation should be a seperate self-contained node with its own schema version. They should not be mixed together. There should be single API toward the drivers, there should not get "automatic" preservation because they put magic stuff in the FDT. > I don't see how the "mem" property contradicts future extensions and for > simple use cases it is already enough. You'd just have to throw out all this code parsing mem to build the memblock. It also makes the weird preallocation of the FDT and it's related sysfs probably unnecessary as it seems largely driven by this unbounded mem attribute problem. > I did an experiment and preserved 1GiB of random order-0 pages and measured > time required to reserve everything in memblock. > kho_deserialize() you suggested slightly outperformed > kho_init_reserved_pages() that parsed a single "mem" property containing > an array of <addr, size> pairs. It has to be considered end-to-end, there is more cost to build up the FDT array, and copying it around as well. Your 16Gib of random order 0 pages is 64MB of FDT space to represent as 16 byte addr/len pairs. That's alot of memory to be allocating, zeroing and copying around three (or four/five?) times. So if the bitmap parsing is already slightly faster I expect the whole end-to-end solution will be notably faster. > For more random distribution of orders and more deep FDT the > difference or course would be higher, but still both options sucked > relatively to a maple tree serialized similarly to your tracker > xarray. I didn't like a maple tree like thing because the worst case memory requirements become much higher - and it is more expensive to build it on the serializing side (you have to run maple tree algorithms per-4k, and then copy it out of the maple tree to a representation). Maybe it is better, but I'd defer to real data on real systems before deciding. With the numbers I was working with there are 512k of bitmaps worst case for 16G of memory. If you imagine encoding ranges in, say, 8 bytes per range (52 bits of phys_addr_t, 12 bits of length) then you get about 65k of ranges in the same 512k. That is only enough to store a random distribution of 256MB of 4k pages. Still, I'd like the see the memory preservation have its own independent scheme, so if there is a better approach it can be upgraded as self-contained project. It should have no effect on the schema of the other nodes, or API toward the drivers. > > But why? Just do it right from the start? I spent like a hour > > sketching that, the existing preservation code is also very simple, > > why not just fix it right now? > > As I see it, we can have both. "mem" property for simple use cases, or as a > partial solution for complex use cases and tracker you proposed for > preserving the order of the folios. I don't think that is a good idea. Two ways it is unnecessarily complicated. memory preservation should be integral to the system and be done in one way that works well for all cases. We definately don't want two APIs toward drivers for this. If we have the bitmap then all drivers should be updated to use it. The core code parsing of the mem schema should be removed. > And as another optimization we may want a maple tree for coalescing as much > as possible to reduce amount of memblock_reserve() calls. Is the bitmap scanning really such a high cost? It can be coalescing the set bitmap ranges with ffs/ffz if you want to run a memblock_reserve() sort of thing. However, I was not imagining using something as inefficient as memblock_reserve() in the long run. It doesn't make sense to take a bitmap and then convert it into ranges, parse the ranges to build up the free list, then throw away the ranges. Instead the bitmaps should be consulted as the free list is being built up immediately after allocating the struct pages. No ranges ever. I didn't try to show this because it is definately complicated, but the serialize side has eveything indexed in xarrays so it can generate a linear sorted list of 'de-serializing' instructions that are slices of bitmaps of different orders. The code that builds the free list would simply walk that linear list of instructions and not add memory with set bits to the free list. Simple O(1) de-serialzing approach with some cost on the serializing side I think going through memblock_reserve() is a good starting point, but there is certainly alot of room for improving away from using ranges. Jason