Re: [PATCH 1/4] KVM: VMX: read the PML log in the same order as it was written

Maxim Levitsky <mlevitsk@xxxxxxxxxx> · Thu, 12 Dec 2024 16:37:44 -0500

On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> On Wed, Dec 11, 2024, Maxim Levitsky wrote:
> > X86 spec specifies that the CPU writes to the PML log 'backwards'
> 
> SDM, because this is Intel specific.
True.
> 
> > or in other words, it first writes entry 511, then entry 510 and so on,
> > until it writes entry 0, after which the 'PML log full' VM exit happens.
> > 
> > I also confirmed on the bare metal that the CPU indeed writes the entries
> > in this order.
> > 
> > KVM on the other hand, reads the entries in the opposite order, from the
> > last written entry and towards entry 511 and dumps them in this order to
> > the dirty ring.
> > 
> > Usually this doesn't matter, except for one complex nesting case:
> > 
> > KVM reties the instructions that cause MMU faults.
> > This might cause an emulated PML log entry to be visible to L1's hypervisor
> > before the actual memory write was committed.
> > 
> > This happens when the L0 MMU fault is followed directly by the VM exit to
> > L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full'
> > event.
> 
> Hmm, this an L0 bug.  Exiting to L1 to deliver a pending IRQ in the middle of an
> instruction is a blatant architectural violation.  As discussed in the RSM =>
> SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU
> needs to enter the guest before generating an exit to L1.

Agree, note that this is to some extent visible not nested as well, for example 
this workaround that the test has is for non nested case.

One can argue that dirty ring is not an x86 feature, but I am sure that there are
other more complicated cases in which retried write can be observed by this or
other vCPUs in the violation of x86 spec.

> 
> Oof.  It's probably worse than that.  For this case, KVM would need to ensure the
> original instruction *completed*.  That would get really, really ugly.  And for
> something like VSCATTER, where each write can be completed independently, trying
> to do the right thing for PML would be absurdly complex.

I also agree. Instruction retry is much simpler and safer that emulating it, KVM
really can't stop doing this.

> I'm not opposed to fudging around processing the PML log in the "correct" order,
> because irrespective of this bug, populating the dirty ring using order in which
> accesses occurred is probably a good idea.
> 
> But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> that avoiding exits to L1 would be beneficial, but what use case actually cares
> about dirty logging performance in L1?

It does help with performance by a lot and the implementation is emulated and simple.

For example this test without pml collects about 500 pages on each iteration
with default parameters, and about 2400 pages per iteration with pml 
(after the caches warm up).

> 
> [*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@xxxxxxxxxx
> 
> > This problem doesn't have a noticeable real-world impact because this
> > write retry is not much different from the guest writing to the same page
> > multiple times, which is also not reflected in the dirty log. The users of
> > the dirty logging only rely on correct reporting of the clean pages, or
> > in other words they assume that if a page is clean, then no writes were
> > committed to it since the moment it was marked clean.
> > 
> > However KVM has a kvm_dirty_log_test selftest, a test that tests both
> > the clean and the dirty pages vs the memory contents, and can fail if it
> > detects a dirty page which has an old value at the offset 0 which the test
> > writes.
> > 
> > To avoid failure, the test has a workaround for this specific problem:
> > 
> > The test skips checking memory that belongs to the last dirty ring entry,
> > which it has seen, relying on the fact that as long as memory writes are
> > committed in-order, only the last entry can belong to a not yet committed
> > memory write.
> > 
> > However, since L1's KVM is reading the PML log in the opposite direction
> > that L0 wrote it, the last dirty ring entry often will be not the last
> > entry written by the L0.
> > 
> > To fix this, switch the order in which KVM reads the PML log.
> > 
> > Note that this issue is not present on the bare metal, because on the
> > bare metal, an update of the A/D bits of a present entry, PML logging and
> > the actual memory write are all done by the CPU without any hypervisor
> > intervention and pending interrupt evaluation, thus once a PML log and/or
> > vCPU kick happens, all memory writes that are in the PML log are
> > committed to memory.
> > 
> > The only exception to this rule is when the guest hits a not present EPT
> > entry, in which case KVM first reads (backward) the PML log, dumps it to
> > the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
> > this to the dirty ring, thus making the entry be the last one in the
> > dirty ring.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
> >  arch/x86/kvm/vmx/vmx.h |  1 +
> >  2 files changed, 22 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 0f008f5ef6f0..6fb946b58a75 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	u64 *pml_buf;
> > -	u16 pml_idx;
> > +	u16 pml_idx, pml_last_written_entry;
> > +	int i;
> >  
> >  	pml_idx = vmcs_read16(GUEST_PML_INDEX);
> >  
> >  	/* Do nothing if PML buffer is empty */
> > -	if (pml_idx == (PML_ENTITY_NUM - 1))
> > +	if (pml_idx == PML_LAST_ENTRY)
> 
> Heh, this is mildly confusing, in that the first entry filled is actually called
> the "last" entry by KVM.  And then below, pml_list_written_entry could point at
> the first entry.
> 
> The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry
> becomes pml_tail_index.  It's not a circular buffer, but I think/hope head/tail
> terminology would be intuitive for most readers.

I agree here. Your proposal does seem better to me, so I'll adopt it in v2.
> 
> E.g. the for-loop becomes:
> 
> 	for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--)
> 		u64 gpa;
> 
> 		gpa = pml_buf[i];
> 		WARN_ON(gpa & (PAGE_SIZE - 1));
> 		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> 	}
> 
> >  		return;
> > +	/*
> > +	 * PML index always points to the next available PML buffer entity
> > +	 * unless PML log has just overflowed, in which case PML index will be
> 

> If you don't have a strong preference, I vote to do s/entity/entry and then rename
> PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?).  I find the
> existing "entity" terminology weird and unhelpful, and arguably wrong.

I don't mind renaming this.

> 
>   entity - a thing with distinct and independent existence.
> 
> The things being consumed are entries in a buffer.
> 
> > +	 * 0xFFFF.
> > +	 */
> > +	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
> >  
> > -	/* PML index always points to next available PML buffer entity */
> > -	if (pml_idx >= PML_ENTITY_NUM)
> > -		pml_idx = 0;
> > -	else
> > -		pml_idx++;
> > -
> > +	/*
> > +	 * PML log is written backwards: the CPU first writes the entity 511
> > +	 * then the entity 510, and so on, until it writes the entity 0 at which
> > +	 * point the PML log full VM exit happens and the logging stops.
> 
> This is technically wrong.  The PML Full exit only occurs on the next write.
> E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.

I skipped over this part of the PRM when I was reading upon how PML works.
I will drop this sentence in the next version.

> 
>   If the PML index is not in the range 0–511, there is a page-modification log-full
>   event and a VM exit occurs. In this case, the accessed or dirty flag is not set,
>   and the guest-physical access that triggered the event does not occur.
> 

Do you have any comments for the rest of the patch series? If not then I'll send
v2 of the patch series.

Best regards,
	Maxim Levitsky