On Sat, Mar 03, 2012 at 02:21:48PM +0900, Takuya Yoshikawa wrote: > We have seen some problems of the current implementation of > get_dirty_log() which uses synchronize_srcu_expedited() for updating > dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms > order of latency when we use VGA displays. > > Furthermore the recent discussion on the following thread > "srcu: Implement call_srcu()" > http://lkml.org/lkml/2012/1/31/211 > also motivated us to implement get_dirty_log() without SRCU. > > This patch achieves this goal without sacrificing the performance of > both VGA and live migration: in practice the new code is much faster > than the old one unless we have too many dirty pages. > > Implementation: > > The key part of the implementation is the use of xchg() operation for > clearing dirty bits atomically. Since this allows us to update only > BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap > until every dirty bit is cleared again for the next call. > > Although some people may worry about the problem of using the atomic > memory instruction many times to the concurrently accessible bitmap, > it is usually accessed with mmu_lock held and we rarely see concurrent > accesses: so what we need to care about is the pure xchg() overheads. > > Another point to note is that we do not use for_each_set_bit() to check > which ones in each BITS_PER_LONG pages are actually dirty. Instead we > simply use __ffs() in a loop. This is much faster than repeatedly call > find_next_bit(). > > Performance: > > The dirty-log-perf unit test showed nice improvements, some times faster > than before, except for some extreme cases; for such cases the speed of > getting dirty page information is much faster than we process it in the > userspace. > > For real workloads, both VGA and live migration, we have observed pure > improvements: when the guest was reading a file during live migration, > we originally saw a few ms of latency, but with the new method the > latency was less than 200us. > > Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@xxxxxxxxxxxxx> > --- > arch/x86/kvm/x86.c | 116 +++++++++++++++++++-------------------------------- > 1 files changed, 43 insertions(+), 73 deletions(-) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 3bc1922..0748bab 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -3041,55 +3041,32 @@ static int kvm_vm_ioctl_reinject(struct kvm *kvm, > } > > /** > - * write_protect_slot - write protect a slot for dirty logging > - * @kvm: the kvm instance > - * @memslot: the slot we protect > - * @dirty_bitmap: the bitmap indicating which pages are dirty > - * @nr_dirty_pages: the number of dirty pages > + * kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot > + * @kvm: kvm instance > + * @log: slot id and address to which we copy the log > * > - * We have two ways to find all sptes to protect: > - * 1. Use kvm_mmu_slot_remove_write_access() which walks all shadow pages and > - * checks ones that have a spte mapping a page in the slot. > - * 2. Use kvm_mmu_rmap_write_protect() for each gfn found in the bitmap. > + * We need to keep it in mind that VCPU threads can write to the bitmap > + * concurrently. So, to avoid losing data, we keep the following order for > + * each bit: > * > - * Generally speaking, if there are not so many dirty pages compared to the > - * number of shadow pages, we should use the latter. > + * 1. Take a snapshot of the bit and clear it if needed. > + * 2. Write protect the corresponding page. > + * 3. Flush TLB's if needed. > + * 4. Copy the snapshot to the userspace. > * > - * Note that letting others write into a page marked dirty in the old bitmap > - * by using the remaining tlb entry is not a problem. That page will become > - * write protected again when we flush the tlb and then be reported dirty to > - * the user space by copying the old bitmap. > + * Between 2 and 3, the guest may write to the page using the remaining TLB > + * entry. This is not a problem because the page will be reported dirty at > + * step 4 using the snapshot taken before and step 3 ensures that successive > + * writes will be logged for the next call. > */ > -static void write_protect_slot(struct kvm *kvm, > - struct kvm_memory_slot *memslot, > - unsigned long *dirty_bitmap, > - unsigned long nr_dirty_pages) > -{ > - spin_lock(&kvm->mmu_lock); > - > - /* Not many dirty pages compared to # of shadow pages. */ > - if (nr_dirty_pages < kvm->arch.n_used_mmu_pages) { > - gfn_t offset; > - > - for_each_set_bit(offset, dirty_bitmap, memslot->npages) > - kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, 1); > - > - kvm_flush_remote_tlbs(kvm); > - } else > - kvm_mmu_slot_remove_write_access(kvm, memslot->id); > - > - spin_unlock(&kvm->mmu_lock); > -} > - > -/* > - * Get (and clear) the dirty memory log for a memory slot. > - */ > -int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, > - struct kvm_dirty_log *log) > +int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log) > { > int r; > struct kvm_memory_slot *memslot; > - unsigned long n, nr_dirty_pages; > + unsigned long n, i; > + unsigned long *dirty_bitmap; > + unsigned long *dirty_bitmap_buffer; > + bool is_dirty = false; > > mutex_lock(&kvm->slots_lock); > > @@ -3098,49 +3075,42 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, > goto out; > > memslot = id_to_memslot(kvm->memslots, log->slot); > + > + dirty_bitmap = memslot->dirty_bitmap; > r = -ENOENT; > - if (!memslot->dirty_bitmap) > + if (!dirty_bitmap) > goto out; > > n = kvm_dirty_bitmap_bytes(memslot); > - nr_dirty_pages = memslot->nr_dirty_pages; > > - /* If nothing is dirty, don't bother messing with page tables. */ > - if (nr_dirty_pages) { > - struct kvm_memslots *slots, *old_slots; > - unsigned long *dirty_bitmap, *dirty_bitmap_head; > + dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long); > + memset(dirty_bitmap_buffer, 0, n); > > - dirty_bitmap = memslot->dirty_bitmap; > - dirty_bitmap_head = memslot->dirty_bitmap_head; > - if (dirty_bitmap == dirty_bitmap_head) > - dirty_bitmap_head += n / sizeof(long); > - memset(dirty_bitmap_head, 0, n); > + spin_lock(&kvm->mmu_lock); It is not clear why mmu_lock is needed. Dropping it across the xchg loop should be similar to srcu implementation, in that concurrent updates will be visible only on the next get_dirty call? Well, it is necessary anyway for write protecting the sptes. A cond_resched_lock() would alleviate the potentially long held times for mmu_lock (can you measure it with large memslots?) Otherwise looks nice. > > - r = -ENOMEM; > - slots = kmemdup(kvm->memslots, sizeof(*kvm->memslots), GFP_KERNEL); > - if (!slots) > - goto out; > + for (i = 0; i < n / sizeof(long); i++) { > + unsigned long mask; > + gfn_t offset; > > - memslot = id_to_memslot(slots, log->slot); > - memslot->nr_dirty_pages = 0; > - memslot->dirty_bitmap = dirty_bitmap_head; > - update_memslots(slots, NULL); > + if (!dirty_bitmap[i]) > + continue; > > - old_slots = kvm->memslots; > - rcu_assign_pointer(kvm->memslots, slots); > - synchronize_srcu_expedited(&kvm->srcu); > - kfree(old_slots); > + is_dirty = true; > > - write_protect_slot(kvm, memslot, dirty_bitmap, nr_dirty_pages); > + mask = xchg(&dirty_bitmap[i], 0); > + dirty_bitmap_buffer[i] = mask; > > - r = -EFAULT; > - if (copy_to_user(log->dirty_bitmap, dirty_bitmap, n)) > - goto out; > - } else { > - r = -EFAULT; > - if (clear_user(log->dirty_bitmap, n)) > - goto out; > + offset = i * BITS_PER_LONG; > + kvm_mmu_write_protect_pt_masked(kvm, memslot, offset, mask); > } > + if (is_dirty) > + kvm_flush_remote_tlbs(kvm); > + > + spin_unlock(&kvm->mmu_lock); > + > + r = -EFAULT; > + if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n)) > + goto out; > > r = 0; > out: > -- > 1.7.5.4 > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html