On Thu, Mar 07, 2024, David Matlack wrote: > On Thu, Mar 7, 2024 at 3:27 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > > On 2024-03-07 02:37 PM, Sean Christopherson wrote: > > > On Thu, Mar 07, 2024, David Matlack wrote: > > > > Create memslot 0 at 0x100000000 (4GiB) to avoid it overlapping with > > > > KVM's private memslot for the APIC-access page. > > > > > > This is going to cause other problems, e.g. from max_guest_memory_test.c > > > > > > /* > > > * Skip the first 4gb and slot0. slot0 maps <1gb and is used to back > > > * the guest's code, stack, and page tables. Because selftests creates > > > * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot > > > * just below the 4gb boundary. This test could create memory at > > > * 1gb-3gb,but it's simpler to skip straight to 4gb. > > > */ > > > const uint64_t start_gpa = SZ_4G; > > > > > > Trying to move away from starting at '0' is going to be problematic/annoying, > > > e.g. using low memory allows tests to safely assume 4GiB+ is always available. > > > And I'd prefer not to make the infrastucture all twisty and weird for all tests > > > just because memstress tests want to play with huge amounts of memory. > > > > > > Any chance we can solve this by using huge pages in the guest, and adjusting the > > > gorilla math in vm_nr_pages_required() accordingly? There's really no reason to > > > use 4KiB pages for a VM with 256GiB of memory. That'd also be more represantitive > > > of real world workloads (at least, I hope real world workloads are using 2MiB or > > > 1GiB pages in this case). > > > > There are real world workloads that use TiB of RAM with 4KiB mappings > > (looking at you SAP HANA). > > > > What about giving tests an explicit "start" GPA they can use? That would > > fix max_guest_memory_test and avoid tests making assumptions about 4GiB > > being a magically safe address to use. > > > > e.g. Something like this on top: > > Gah, I missed nx_huge_page_test.c, which needs similar changes to > max_memory_test.c. > > Also if you prefer the "start" address be a compile time constant we > can pick an arbitrary number above 4GiB and use that (e.g. start=32GiB > would be more than enough for a 12TiB guest). Aha! Idea. I think we can clean up multiple warts at once. The underlying problem is the memory that is allocated for guest page tables. The core code allocation is constant for any given test, and a complete non-issue unless someone writes a *really* crazy test. And while core data (stack) allocations scale with the number of vCPUs, they are (a) articifically bounded by the maximum number of vCPUs and (b) relatively small allocations (a few pages per vCPU). And for page table allocations, we already have this absurd magic value: #define KVM_GUEST_PAGE_TABLE_MIN_PADDR 0x180000 which I'm guessing exists to avoid clobbering code+stack allocations, but that's an irrelevant tangent. The other asset we now have is vm->memslots[NR_MEM_REGIONS], and more specifically allocations for guest page tables are done via vm->memslots[MEM_REGION_PT]. So, rather than more hardcoded addresses and/or a knob to control _all_ code allocations, I think we should provide knob to say that MEM_REGION_PT should go to memory above 4GiB. And to make memslot handling maintainable in the long term: 1. Add a knob to place MEM_REGION_PT at 4GiB (and as of this initial patch, conditionally in their own memslot). 2. Use the PT_AT_4GIB (not the real name) knob for the various memstress tests that need it. 3. Formalize memslots 0..2 (CODE, DATA, and PT) as being owned by the library, with memslots 3..MAX available for test usage. 4. Modify tests that assume memslots 1..MAX are available, i.e. force them to start at MEM_REGION_TEST_DATA. 5. Use separate memslots for CODE, DATA, and PT by default. This will allow for more precise sizing of the CODE and DATA slots. 6. Shrink the number of pages for CODE to a more reasonable number. Currently vm_nr_pages_required() reserves 512 pages / 2MiB for per-VM assets, which at a glance seems ridiculously excessive. 7. Use the PT_AT_4GIB knob in s390's CMMA test? I suspect it does memslot shenanigans purely so that a low gfn (4096 in the test) is guaranteed to be available. For #4, I think it's a fairly easy change. E.g. set_memory_region_test.c, just do s/MEM_REGION_SLOT/MEM_REGION_TEST_DATA. And for max_guest_memory_test.c: @@ -157,14 +154,14 @@ static void calc_default_nr_vcpus(void) int main(int argc, char *argv[]) { /* - * Skip the first 4gb and slot0. slot0 maps <1gb and is used to back - * the guest's code, stack, and page tables. Because selftests creates + * Skip the first 4gb, which are reserved by the core library for the + * guest's code, stack, and page tables. Because selftests creates * an IRQCHIP, a.k.a. a local APIC, KVM creates an internal memslot * just below the 4gb boundary. This test could create memory at * 1gb-3gb,but it's simpler to skip straight to 4gb. */ const uint64_t start_gpa = SZ_4G; - const int first_slot = 1; + const int first_slot = MEM_REGION_TEST_DATA; struct timespec time_start, time_run1, time_reset, time_run2; uint64_t max_gpa, gpa, slot_size, max_mem, i;