On Thu, Mar 23, 2023 at 2:59 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > On Mon, Mar 06, 2023 at 02:41:20PM -0800, Vipin Sharma wrote: > > Add documentation for KVM_CAP_NUMA_AWARE_PAGE_TABLE capability and > > explain why it is needed. > > > > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx> > > --- > > Documentation/virt/kvm/api.rst | 29 +++++++++++++++++++++++++++++ > > 1 file changed, 29 insertions(+) > > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > > index 62de0768d6aa..7e3a1299ca8e 100644 > > --- a/Documentation/virt/kvm/api.rst > > +++ b/Documentation/virt/kvm/api.rst > > @@ -7669,6 +7669,35 @@ This capability is aimed to mitigate the threat that malicious VMs can > > cause CPU stuck (due to event windows don't open up) and make the CPU > > unavailable to host or other VMs. > > > > +7.34 KVM_CAP_NUMA_AWARE_PAGE_TABLE > > +------------------------------ > > + > > +:Architectures: x86 > > +:Target: VM > > +:Returns: 0 on success, -EINVAL if vCPUs are already created. > > + > > +This capability allows userspace to enable NUMA aware page tables allocations. > > Call out that this capability overrides task mempolicies. e.g. > > This capability causes KVM to use a custom NUMA memory policy when > allocating page tables. Specifically, KVM will attempt to co-locate > page tables pages with the memory that they map, rather than following > the mempolicy of the current task. > > > +NUMA aware page tables are disabled by default. Once enabled, prior to vCPU > > +creation, any page table allocated during the life of a VM will be allocated > > The "prior to vCPU creation" part here is confusing because it sounds > like you're talking about any page tables allocated before vCPU > creation. Just delete that part and put it in a separate paragraph. > > KVM_CAP_NUMA_AWARE_PAGE_TABLE must be enabled before any vCPU is > created, otherwise KVM will return -EINVAL. > > > +preferably from the NUMA node of the leaf page. > > + > > +Without this capability, default feature is to use current thread mempolicy and > > s/default feature is to/KVM will/ > > > +allocate page table based on that. > > s/and allocate page table based on that./to allocate page tables./ > > > + > > +This capability is useful to improve page accesses by a guest. For example, an > > nit: Be more specific about how. > > This capability aims to minimize the cost of TLB misses when a vCPU is > accessing NUMA-local memory, by reducing the number of remote memory > accesses needed to walk KVM's page tables. > > > +initialization thread which access lots of remote memory and ends up creating > > +page tables on local NUMA node, or some service thread allocates memory on > > +remote NUMA nodes and later worker/background threads accessing that memory > > +will end up accessing remote NUMA node page tables. > > It's not clear if these examples are talking about what happens when > KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled or disabled. > > Also it's important to distinguish virtual NUMA nodes from physical NUMA > nodes and where these "threads" are running. How about this: > > For example, when KVM_CAP_NUMA_AWARE_PAGE_TABLE is disabled and a vCPU > accesses memory on a remote NUMA node and triggers a KVM page fault, > KVM will allocate page tables to handle that fault on the node where > the vCPU is running rather than the node where the memory is allocated. > When KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled, KVM will allocate the > page tables on the node where the memory is located. > > This is intended to be used in VM configurations that properly > virtualize NUMA. i.e. VMs with one or more virtual NUMA nodes, each of > which is mapped to a physical NUMA node. With this capability enabled > on such VMs, any guest memory access to virtually-local memory will be > translated through mostly[*] physically-local page tables, regardless > of how the memory was faulted in. > > [*] KVM will fallback to allocating from remote NUMA nodes if the > preferred node is out of memory. Also, in VMs with 2 or more NUMA > nodes, higher level page tables will necessarily map memory across > multiple physical nodes. > > > So, a multi NUMA node > > +guest, can with high confidence access local memory faster instead of going > > +through remote page tables first. > > + > > +This capability is also helpful for host to reduce live migration impact when > > +splitting huge pages during dirty log operations. If the thread splitting huge > > +page is on remote NUMA node it will create page tables on remote node. Even if > > +guest is careful in making sure that it only access local memory they will end > > +up accessing remote page tables. > > Please also cover the limitations of this feature: > > - Impact on remote memory accesses (more expensive). > - How KVM handles NUMA node exhaustion. > - How high-level page tables can span multiple nodes. > - What KVM does if it can't determine the NUMA node of the pfn. > - What KVM does for faults on GPAs that aren't backed by a pfn. > Thanks for the suggestions, I will incorporate them in the next version.