On 05.08.22 15:38, Vlastimil Babka wrote: > On 8/5/22 14:09, David Hildenbrand wrote: >> On 05.08.22 13:49, Vlastimil Babka wrote: >>> On 6/14/22 14:02, Kirill A. Shutemov wrote: >>>> UEFI Specification version 2.9 introduces the concept of memory >>>> acceptance. Some Virtual Machine platforms, such as Intel TDX or AMD >>>> SEV-SNP, require memory to be accepted before it can be used by the >>>> guest. Accepting happens via a protocol specific to the Virtual Machine >>>> platform. >>>> >>>> There are several ways kernel can deal with unaccepted memory: >>>> >>>> 1. Accept all the memory during the boot. It is easy to implement and >>>> it doesn't have runtime cost once the system is booted. The downside >>>> is very long boot time. >>>> >>>> Accept can be parallelized to multiple CPUs to keep it manageable >>>> (i.e. via DEFERRED_STRUCT_PAGE_INIT), but it tends to saturate >>>> memory bandwidth and does not scale beyond the point. >>>> >>>> 2. Accept a block of memory on the first use. It requires more >>>> infrastructure and changes in page allocator to make it work, but >>>> it provides good boot time. >>>> >>>> On-demand memory accept means latency spikes every time kernel steps >>>> onto a new memory block. The spikes will go away once workload data >>>> set size gets stabilized or all memory gets accepted. >>>> >>>> 3. Accept all memory in background. Introduce a thread (or multiple) >>>> that gets memory accepted proactively. It will minimize time the >>>> system experience latency spikes on memory allocation while keeping >>>> low boot time. >>>> >>>> This approach cannot function on its own. It is an extension of #2: >>>> background memory acceptance requires functional scheduler, but the >>>> page allocator may need to tap into unaccepted memory before that. >>>> >>>> The downside of the approach is that these threads also steal CPU >>>> cycles and memory bandwidth from the user's workload and may hurt >>>> user experience. >>>> >>>> Implement #2 for now. It is a reasonable default. Some workloads may >>>> want to use #1 or #3 and they can be implemented later based on user's >>>> demands. >>>> >>>> Support of unaccepted memory requires a few changes in core-mm code: >>>> >>>> - memblock has to accept memory on allocation; >>>> >>>> - page allocator has to accept memory on the first allocation of the >>>> page; >>>> >>>> Memblock change is trivial. >>>> >>>> The page allocator is modified to accept pages on the first allocation. >>>> The new page type (encoded in the _mapcount) -- PageUnaccepted() -- is >>>> used to indicate that the page requires acceptance. >>>> >>>> Architecture has to provide two helpers if it wants to support >>>> unaccepted memory: >>>> >>>> - accept_memory() makes a range of physical addresses accepted. >>>> >>>> - range_contains_unaccepted_memory() checks anything within the range >>>> of physical addresses requires acceptance. >>>> >>>> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> >>>> Acked-by: Mike Rapoport <rppt@xxxxxxxxxxxxx> # memblock >>>> Reviewed-by: David Hildenbrand <david@xxxxxxxxxx> >>> >>> Hmm I realize it's not ideal to raise this at v7, and maybe it was discussed >>> before, but it's really not great how this affects the core page allocator >>> paths. Wouldn't it be possible to only release pages to page allocator when >>> accepted, and otherwise use some new per-zone variables together with the >>> bitmap to track how much exactly is where to accept? Then it could be hooked >>> in get_page_from_freelist() similarly to CONFIG_DEFERRED_STRUCT_PAGE_INIT - >>> if we fail zone_watermark_fast() and there are unaccepted pages in the zone, >>> accept them and continue. With a static key to flip in case we eventually >>> accept everything. Because this is really similar scenario to the deferred >>> init and that one was solved in a way that adds minimal overhead. >> >> I kind of like just having the memory stats being correct (e.g., free >> memory) and acceptance being an internal detail to be triggered when >> allocating pages -- just like the arch_alloc_page() callback. > > Hm, good point about the stats. Could be tweaked perhaps so it appears > correct on the outside, but might be tricky. > >> I'm sure we could optimize for the !unaccepted memory via static keys >> also in this version with some checks at the right places if we find >> this to hurt performance? > > It would be great if we would at least somehow hit the necessary code only > when dealing with a >=pageblock size block. The bitmap approach and > accepting everything smaller uprofront actually seems rather compatible. Yet > in the current patch we e.g. check PageUnaccepted(buddy) on every buddy size > while merging. > > A list that sits besides the existing free_area, contains only >=pageblock > order sizes of unaccepted pages (no migratetype distinguished) and we tap > into it approximately before __rmqueue_fallback()? There would be some > trickery around releasing zone-lock for doing accept_memory(), but should be > manageable. > Just curious, do we have a microbenchmark that is able to reveal the impact of such code changes before we start worrying? -- Thanks, David / dhildenb