On Fri, Aug 29, 2014 at 3:31 AM, Paolo Bonzini <pbonzini@xxxxxxxxxx> wrote: > From: David Matlack <dmatlack@xxxxxxxxxx> > > vcpu exits and memslot mutations can run concurrently as long as the > vcpu does not aquire the slots mutex. Thus it is theoretically possible > for memslots to change underneath a vcpu that is handling an exit. > > If we increment the memslot generation number again after > synchronize_srcu_expedited(), vcpus can safely cache memslot generation > without maintaining a single rcu_dereference through an entire vm exit. > And much of the x86/kvm code does not maintain a single rcu_dereference > of the current memslots during each exit. > > We can prevent the following case: > > vcpu (CPU 0) | thread (CPU 1) > --------------------------------------------+-------------------------- > 1 vm exit | > 2 srcu_read_unlock(&kvm->srcu) | > 3 decide to cache something based on | > old memslots | > 4 | change memslots > | (increments generation) > 5 | synchronize_srcu(&kvm->srcu); > 6 retrieve generation # from new memslots | > 7 tag cache with new memslot generation | > 8 srcu_read_unlock(&kvm->srcu) | > ... | > <action based on cache occurs even | > though the caching decision was based | > on the old memslots> | > ... | > <action *continues* to occur until next | > memslot generation change, which may | > be never> | > | > > By incrementing the generation after synchronizing with kvm->srcu readers, > we ensure that the generation retrieved in (6) will become invalid soon > after (8). > > Keeping the existing increment is not strictly necessary, but we > do keep it and just move it for consistency from update_memslots to > install_new_memslots. It invalidates old cached MMIOs immediately, > instead of having to wait for the end of synchronize_srcu_expedited, > which makes the code more clearly correct in case CPU 1 is preempted > right after synchronize_srcu() returns. > > To avoid halving the generation space in SPTEs, always presume that the > low bit of the generation is zero when reconstructing a generation number > out of an SPTE. This effectively disables MMIO caching in SPTEs during > the call to synchronize_srcu_expedited. Using the low bit this way is > somewhat like a seqcount---where the protected thing is a cache, and > instead of retrying we can simply punt if we observe the low bit to be 1. > > Cc: stable@xxxxxxxxxxxxxxx > Cc: Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx> > Signed-off-by: David Matlack <dmatlack@xxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > --- > Documentation/virtual/kvm/mmu.txt | 14 ++++++++++++++ > arch/x86/kvm/mmu.c | 20 ++++++++++++-------- > virt/kvm/kvm_main.c | 23 ++++++++++++++++------- > 3 files changed, 42 insertions(+), 15 deletions(-) > > diff --git a/Documentation/virtual/kvm/mmu.txt b/Documentation/virtual/kvm/mmu.txt > index 290894176142..53838d9c6295 100644 > --- a/Documentation/virtual/kvm/mmu.txt > +++ b/Documentation/virtual/kvm/mmu.txt > @@ -425,6 +425,20 @@ fault through the slow path. > Since only 19 bits are used to store generation-number on mmio spte, all > pages are zapped when there is an overflow. > > +Unfortunately, a single memory access might access kvm_memslots(kvm) multiple > +times, the last one happening when the generation number is retrieved and > +stored into the MMIO spte. Thus, the MMIO spte might be created based on > +out-of-date information, but with an up-to-date generation number. > + > +To avoid this, the generation number is incremented again after synchronize_srcu > +returns; thus, the low bit of kvm_memslots(kvm)->generation is only 1 during a > +memslot update, while some SRCU readers might be using the old copy. We do not > +want to use an MMIO sptes created with an odd generation number, and we can do > +this without losing a bit in the MMIO spte. The low bit of the generation > +is not stored in MMIO spte, and presumed zero when it is extracted out of the > +spte. If KVM is unlucky and creates an MMIO spte while the low bit is 1, > +the next access to the spte will always be a cache miss. > + > > Further reading > =============== > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 323c3f5f5c84..96515957ba82 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -199,16 +199,20 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask) > EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask); > > /* > - * spte bits of bit 3 ~ bit 11 are used as low 9 bits of generation number, > - * the bits of bits 52 ~ bit 61 are used as high 10 bits of generation > - * number. > + * the low bit of the generation number is always presumed to be zero. > + * This disables mmio caching during memslot updates. The concept is > + * similar to a seqcount but instead of retrying the access we just punt > + * and ignore the cache. > + * > + * spte bits 3-11 are used as bits 1-9 of the generation number, > + * the bits 52-61 are used as bits 10-19 of the generation number. > */ > -#define MMIO_SPTE_GEN_LOW_SHIFT 3 > +#define MMIO_SPTE_GEN_LOW_SHIFT 2 > #define MMIO_SPTE_GEN_HIGH_SHIFT 52 > > -#define MMIO_GEN_SHIFT 19 > -#define MMIO_GEN_LOW_SHIFT 9 > -#define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 1) > +#define MMIO_GEN_SHIFT 20 > +#define MMIO_GEN_LOW_SHIFT 10 > +#define MMIO_GEN_LOW_MASK ((1 << MMIO_GEN_LOW_SHIFT) - 2) > #define MMIO_GEN_MASK ((1 << MMIO_GEN_SHIFT) - 1) > #define MMIO_MAX_GEN ((1 << MMIO_GEN_SHIFT) - 1) > > @@ -4428,7 +4432,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm) > * The very rare case: if the generation-number is round, > * zap all shadow pages. > */ > - if (unlikely(kvm_current_mmio_generation(kvm) >= MMIO_MAX_GEN)) { > + if (unlikely(kvm_current_mmio_generation(kvm) == 0)) { This should be in patch 1/3. > printk_ratelimited(KERN_INFO "kvm: zapping shadow pages for mmio generation wraparound\n"); > kvm_mmu_invalidate_zap_all_pages(kvm); > } > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 0bfdb673db26..bb8641b5d83b 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -95,8 +95,6 @@ static int hardware_enable_all(void); > static void hardware_disable_all(void); > > static void kvm_io_bus_destroy(struct kvm_io_bus *bus); > -static void update_memslots(struct kvm_memslots *slots, > - struct kvm_memory_slot *new, u64 last_generation); > > static void kvm_release_pfn_dirty(pfn_t pfn); > static void mark_page_dirty_in_slot(struct kvm *kvm, > @@ -695,8 +693,7 @@ static void sort_memslots(struct kvm_memslots *slots) > } > > static void update_memslots(struct kvm_memslots *slots, > - struct kvm_memory_slot *new, > - u64 last_generation) > + struct kvm_memory_slot *new) > { > if (new) { > int id = new->id; > @@ -707,8 +704,6 @@ static void update_memslots(struct kvm_memslots *slots, > if (new->npages != npages) > sort_memslots(slots); > } > - > - slots->generation = last_generation + 1; > } > > static int check_memory_region_flags(struct kvm_userspace_memory_region *mem) > @@ -730,10 +725,24 @@ static struct kvm_memslots *install_new_memslots(struct kvm *kvm, > { > struct kvm_memslots *old_memslots = kvm->memslots; > > - update_memslots(slots, new, kvm->memslots->generation); > + /* > + * Set the low bit in the generation, which disables SPTE caching > + * until the end of synchronize_srcu_expedited. > + */ > + WARN_ON(old_memslots->generation & 1); > + slots->generation = old_memslots->generation + 1; > + > + update_memslots(slots, new); > rcu_assign_pointer(kvm->memslots, slots); > synchronize_srcu_expedited(&kvm->srcu); > > + /* > + * Increment the new memslot generation a second time. This prevents > + * vm exits that race with memslot updates from caching a memslot > + * generation that will (potentially) be valid forever. > + */ > + slots->generation++; > + > kvm_arch_memslots_updated(kvm); > > return old_memslots; > -- > 1.8.3.1 > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html