Re: [PATCH 1/7] KVM: Document KVM_MAP_MEMORY ioctl

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 17 Apr 2024 22:37:00 +0200

On Wed, Apr 17, 2024 at 10:28 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> On Wed, Apr 17, 2024, Paolo Bonzini wrote:
> > +4.143 KVM_MAP_MEMORY
> > +------------------------
> > +
> > +:Capability: KVM_CAP_MAP_MEMORY
> > +:Architectures: none
> > +:Type: vcpu ioctl
> > +:Parameters: struct kvm_map_memory (in/out)
> > +:Returns: 0 on success, < 0 on error
> > +
> > +Errors:
> > +
> > +  ========== ===============================================================
> > +  EINVAL     The specified `base_address` and `size` were invalid (e.g. not
> > +             page aligned or outside the defined memory slots).
>
> "outside the memslots" should probably be -EFAULT, i.e. keep EINVAL for things
> that can _never_ succeed.
>
> > +  EAGAIN     The ioctl should be invoked again and no page was processed.
> > +  EINTR      An unmasked signal is pending and no page was processed.
>
> I'm guessing we'll want to handle large ranges, at which point we'll likely end
> up with EAGAIN and/or EINTR after processing at least one page.

Yes, in that case you get a success (return value of 0), just like read().

> > +  EFAULT     The parameter address was invalid.
> > +  EOPNOTSUPP The architecture does not support this operation, or the
> > +             guest state does not allow it.
>
> I would phrase this as something like:
>
>                 Mapping memory given for a GPA is unsupported by the
>                 architecture, and/or for the current vCPU state/mode.

Better.

> > +  struct kvm_map_memory {
> > +     /* in/out */
> > +     __u64 base_address;
>
> I think we should commit to this being limited to gpa mappings, e.g. go with
> "gpa", or "guest_physical_address" if we want to be verbose (I vote for "gpa").
>
> > +     __u64 size;
> > +     /* in */
> > +     __u64 flags;
> > +     __u64 padding[5];
> > +  };
> > +
> > +KVM_MAP_MEMORY populates guest memory in the page tables of a vCPU.
>
> I think we should word this very carefully and explicitly so that KVM doesn't
> commit to behavior that can't be guaranteed.  We might even want to use a name
> that explicitly captures the semantics, e.g. KVM_PRE_FAULT_MEMORY?
>
> Also, this doesn't populate guest _memory_, and "in the page tables of a vCPU"
> could be interpreted as the _guest's_ page tables.
>
> Something like:
>
>   KVM_PRE_FAULT_MEMORY populates KVM's stage-2 page tables used to map memory
>   for the current vCPU state.  KVM maps memory as if the vCPU generated a
>   stage-2 read page fault, e.g. faults in memory as needed, but doesn't break
>   CoW.  However, KVM does not mark any newly created stage-2 PTE as Accessed.
>
> > +When the ioctl returns, the input values are updated to point to the
> > +remaining range.  If `size` > 0 on return, the caller can just issue
> > +the ioctl again with the same `struct kvm_map_memory` argument.
>
> This is likely misleading.  Unless KVM explicitly zeros size on *every* failure,
> a pedantic reading of this would suggest that userspace can retry and it should
> eventually succeed.

Gotcha... KVM explicitly zeros size on every success, but never zeros
size on a failure.

> > +In some cases, multiple vCPUs might share the page tables.  In this
> > +case, if this ioctl is called in parallel for multiple vCPUs the
> > +ioctl might return with `size` > 0.
>
> Why?  If there's already a valid mapping, mission accomplished.  I don't see any
> reason to return an error.  If x86's page fault path returns RET_PF_RETRY, then I
> think it makes sense to retry in KVM, not punt this to userspace.

Considering that vcpu_mutex critical sections are killable I think I
tend to agree.

> > +The ioctl may not be supported for all VMs, and may just return
> > +an `EOPNOTSUPP` error if a VM does not support it.  You may use
> > +`KVM_CHECK_EXTENSION` on the VM file descriptor to check if it is
> > +supported.
>
> Why per-VM?  I don't think there's any per-VM state that would change the behavior.

Perhaps it may depend on the VM type? I'm trying to avoid having to
invent a different API later. But yeah, I can drop this sentence and
the related code.

> The TDP MMU being enabled is KVM wide, and the guest state modifiers that cause
> problems are per-vCPU, not per-VM.
>
> Adding support for KVM_CHECK_EXTENSION on vCPU FDs is probably overkill, e.g. I
> don't think it would add much value beyond returning EOPNOTSUPP for the ioctl()
> itself.

Yes, I agree.

Paolo