Re: [PATCH v10 2/9] KVM: Introduce per-page memory attributes

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 3 Jan 2023 23:06:37 +0000

On Tue, Jan 03, 2023, Wang, Wei W wrote:
> On Tuesday, January 3, 2023 9:40 AM, Chao Peng wrote:
> > > Because guest memory defaults to private, and now this patch stores
> > > the attributes with KVM_MEMORY_ATTRIBUTE_PRIVATE instead of
> > _SHARED,
> > > it would bring more KVM_EXIT_MEMORY_FAULT exits at the beginning of
> > > boot time. Maybe it can be optimized somehow in other places? e.g. set
> > > mem attr in advance.
> > 
> > KVM defaults to 'shared' because this ioctl can also be potentially used by
> > normal VMs and 'shared' sounds a value meaningful for both normal VMs and
> > confidential VMs. 
> 
> Do you mean a normal VM could have pages marked private? What's the usage?
> (If all the pages are just marked shared for normal VMs, then why do we need it)

No, there are potential use cases for per-page attribute/permissions, e.g. to
make select pages read-only, exec-only, no-exec, etc...

> > As for more KVM_EXIT_MEMORY_FAULT exits during the
> > booting time, yes, setting all memory to 'private' for confidential VMs through
> > this ioctl in userspace before guest launch is an approach for KVM userspace to
> > 'override' the KVM default and reduce the number of implicit conversions.
> 
> Most pages of a confidential VM are likely to be private pages. It seems more efficient
> (and not difficult to check vm_type) to have KVM defaults to "private" for confidential VMs
> and defaults to "shared" for normal VMs.

If done right, the default shouldn't matter all that much for efficiency.  KVM
needs to be able to effeciently track large ranges regardless of the default,
otherwise the memory overhead and the presumably cost of lookups will be painful.
E.g. converting a 1GiB chunk to shared should ideally require one entry, not 256k
entries.

Looks like that behavior was changed in v8 in response to feedback[*] that doing
xa_store_range() on a subset of an existing range (entry) would overwrite the
entire existing range (entry), not just the smaller subset.  xa_store_range() does
appear to be too simplistic for this use case, but looking at __filemap_add_folio(),
splitting an existing entry isn't super complex.

Using xa_store() for the very initial implementation is ok, and probably a good
idea since it's more obviously correct and will give us a bisection point.  But
we definitely want a more performant implementation sooner than later.  The hardest
part will likely be merging existing entries, but that can be done separately too,
and is probably lower priority.

E.g. (1) use xa_store() and always track at 4KiB granularity, (2) support storing
metadata in multi-index entries, and finally (3) support merging adjacent entries
with identical values.

[*] https://lore.kernel.org/all/CAGtprH9xyw6bt4=RBWF6-v2CSpabOCpKq5rPz+e-9co7EisoVQ@xxxxxxxxxxxxxx