Re: [PATCH 1/6] KVM: Cache the least recently used slot index per vCPU

David Matlack <dmatlack@xxxxxxxxxx> · Mon, 2 Aug 2021 16:27:47 +0000

On Mon, Aug 02, 2021 at 04:36:24PM +0200, Paolo Bonzini wrote:
> On 31/07/21 00:37, David Matlack wrote:
> > The memslot for a given gfn is looked up multiple times during page
> > fault handling. Avoid binary searching for it multiple times by caching
> > the least recently used slot. There is an existing VM-wide LRU slot but
> > that does not work well for cases where vCPUs are accessing memory in
> > different slots (see performance data below).
> > 
> > Another benefit of caching the least recently use slot (versus looking
> > up the slot once and passing around a pointer) is speeding up memslot
> > lookups *across* faults and during spte prefetching.
> > 
> > To measure the performance of this change I ran dirty_log_perf_test with
> > 64 vCPUs and 64 memslots and measured "Populate memory time" and
> > "Iteration 2 dirty memory time".  Tests were ran with eptad=N to force
> > dirty logging to use fast_page_fault so its performance could be
> > measured.
> > 
> > Config     | Metric                        | Before | After
> > ---------- | ----------------------------- | ------ | ------
> > tdp_mmu=Y  | Populate memory time          | 6.76s  | 5.47s
> > tdp_mmu=Y  | Iteration 2 dirty memory time | 2.83s  | 0.31s
> > tdp_mmu=N  | Populate memory time          | 20.4s  | 18.7s
> > tdp_mmu=N  | Iteration 2 dirty memory time | 2.65s  | 0.30s
> > 
> > The "Iteration 2 dirty memory time" results are especially compelling
> > because they are equivalent to running the same test with a single
> > memslot. In other words, fast_page_fault performance no longer scales
> > with the number of memslots.
> > 
> > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx>
> 
> It's the *most* recently used slot index, of course. :)  That's true of
> lru_slot as well.

*facepalm*

I'll include a patch in v2 to fix the name of the existing lru_slot to
something like mru_slot or last_used_slot.

> 
> > +static inline struct kvm_memory_slot *get_slot(struct kvm_memslots *slots, int slot_index)
> > +{
> > +	if (slot_index < 0 || slot_index >= slots->used_slots)
> > +		return NULL;
> > +
> > +	return &slots->memslots[slot_index];
> > +}
> > +
> 
> Since there are plenty of arrays inside struct kvm_memory_slot*, do we want
> to protect this against speculative out-of-bounds accesses with
> array_index_nospec?

I'm not sure when it makes sense to use array_index_nospec. Perhaps this
is a good candidate since vcpu->lru_slot_index and the length of
memslots[] can both be controlled by userspace.

I'll play around with adding it to see if there are any performance
trade-offs.

> 
> > +static inline struct kvm_memory_slot *
> > +search_memslots(struct kvm_memslots *slots, gfn_t gfn)
> > +{
> > +	int slot_index = __search_memslots(slots, gfn);
> > +
> > +	return get_slot(slots, slot_index);
> >  }
> 
> Let's use this occasion to remove the duplication between __gfn_to_memslot
> and search_memslots; you can make search_memslots do the search and palce
> the LRU (ehm, MRU) code to __gfn_to_memslot only.  So:
> 
> - the new patch 1 (something like "make search_memslots search without LRU
> caching") is basically this series's patch 2, plus a tree-wide replacement
> of search_memslots with __gfn_to_memslot.
> 
> - the new patch 2 is this series's patch 1, except kvm_vcpu_gfn_to_memslot
> uses search_memslots instead of __search_memslots.  The comments in patch
> 2's commit message about the double misses move to this commit message.

Will do.

> 
> > +	if (slot)
> > +		vcpu->lru_slot_index = slot_index;
> 
> Let's call it lru_slot for consistency with the field of struct
> kvm_memory_slots.

I'll make sure the two have consistent names when I rename them to get
rid of "lru".

> 
> Paolo
>