Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 11 Dec 2024 16:44:16 -0800

On Wed, Dec 11, 2024, Maxim Levitsky wrote:
> X86 spec specifies that the CPU writes to the PML log 'backwards'

SDM, because this is Intel specific.

> or in other words, it first writes entry 511, then entry 510 and so on,
> until it writes entry 0, after which the 'PML log full' VM exit happens.
> 
> I also confirmed on the bare metal that the CPU indeed writes the entries
> in this order.
> 
> KVM on the other hand, reads the entries in the opposite order, from the
> last written entry and towards entry 511 and dumps them in this order to
> the dirty ring.
> 
> Usually this doesn't matter, except for one complex nesting case:
> 
> KVM reties the instructions that cause MMU faults.
> This might cause an emulated PML log entry to be visible to L1's hypervisor
> before the actual memory write was committed.
> 
> This happens when the L0 MMU fault is followed directly by the VM exit to
> L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full'
> event.

Hmm, this an L0 bug.  Exiting to L1 to deliver a pending IRQ in the middle of an
instruction is a blatant architectural violation.  As discussed in the RSM =>
SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU
needs to enter the guest before generating an exit to L1.

Oof.  It's probably worse than that.  For this case, KVM would need to ensure the
original instruction *completed*.  That would get really, really ugly.  And for
something like VSCATTER, where each write can be completed independently, trying
to do the right thing for PML would be absurdly complex.

I'm not opposed to fudging around processing the PML log in the "correct" order,
because irrespective of this bug, populating the dirty ring using order in which
accesses occurred is probably a good idea.

But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
that avoiding exits to L1 would be beneficial, but what use case actually cares
about dirty logging performance in L1?

[*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@xxxxxxxxxx

> This problem doesn't have a noticeable real-world impact because this
> write retry is not much different from the guest writing to the same page
> multiple times, which is also not reflected in the dirty log. The users of
> the dirty logging only rely on correct reporting of the clean pages, or
> in other words they assume that if a page is clean, then no writes were
> committed to it since the moment it was marked clean.
> 
> However KVM has a kvm_dirty_log_test selftest, a test that tests both
> the clean and the dirty pages vs the memory contents, and can fail if it
> detects a dirty page which has an old value at the offset 0 which the test
> writes.
> 
> To avoid failure, the test has a workaround for this specific problem:
> 
> The test skips checking memory that belongs to the last dirty ring entry,
> which it has seen, relying on the fact that as long as memory writes are
> committed in-order, only the last entry can belong to a not yet committed
> memory write.
> 
> However, since L1's KVM is reading the PML log in the opposite direction
> that L0 wrote it, the last dirty ring entry often will be not the last
> entry written by the L0.
> 
> To fix this, switch the order in which KVM reads the PML log.
> 
> Note that this issue is not present on the bare metal, because on the
> bare metal, an update of the A/D bits of a present entry, PML logging and
> the actual memory write are all done by the CPU without any hypervisor
> intervention and pending interrupt evaluation, thus once a PML log and/or
> vCPU kick happens, all memory writes that are in the PML log are
> committed to memory.
> 
> The only exception to this rule is when the guest hits a not present EPT
> entry, in which case KVM first reads (backward) the PML log, dumps it to
> the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
> this to the dirty ring, thus making the entry be the last one in the
> dirty ring.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> ---
>  arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
>  arch/x86/kvm/vmx/vmx.h |  1 +
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0f008f5ef6f0..6fb946b58a75 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u64 *pml_buf;
> -	u16 pml_idx;
> +	u16 pml_idx, pml_last_written_entry;
> +	int i;
>  
>  	pml_idx = vmcs_read16(GUEST_PML_INDEX);
>  
>  	/* Do nothing if PML buffer is empty */
> -	if (pml_idx == (PML_ENTITY_NUM - 1))
> +	if (pml_idx == PML_LAST_ENTRY)

Heh, this is mildly confusing, in that the first entry filled is actually called
the "last" entry by KVM.  And then below, pml_list_written_entry could point at
the first entry.

The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry
becomes pml_tail_index.  It's not a circular buffer, but I think/hope head/tail
terminology would be intuitive for most readers.

E.g. the for-loop becomes:

	for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--)
		u64 gpa;

		gpa = pml_buf[i];
		WARN_ON(gpa & (PAGE_SIZE - 1));
		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
	}

>  		return;
> +	/*
> +	 * PML index always points to the next available PML buffer entity
> +	 * unless PML log has just overflowed, in which case PML index will be

If you don't have a strong preference, I vote to do s/entity/entry and then rename
PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?).  I find the
existing "entity" terminology weird and unhelpful, and arguably wrong.

  entity - a thing with distinct and independent existence.

The things being consumed are entries in a buffer.

> +	 * 0xFFFF.
> +	 */
> +	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
>  
> -	/* PML index always points to next available PML buffer entity */
> -	if (pml_idx >= PML_ENTITY_NUM)
> -		pml_idx = 0;
> -	else
> -		pml_idx++;
> -
> +	/*
> +	 * PML log is written backwards: the CPU first writes the entity 511
> +	 * then the entity 510, and so on, until it writes the entity 0 at which
> +	 * point the PML log full VM exit happens and the logging stops.

This is technically wrong.  The PML Full exit only occurs on the next write.
E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.

  If the PML index is not in the range 0–511, there is a page-modification log-full
  event and a VM exit occurs. In this case, the accessed or dirty flag is not set,
  and the guest-physical access that triggered the event does not occur.