On Thu, Dec 05, 2019 at 08:59:33PM +0100, Paolo Bonzini wrote: > On 05/12/19 20:30, Peter Xu wrote: > >> Try enabling kvmmmu tracepoints too, it will tell > >> you more of the path that was taken while processing the EPT violation. > > > > These new tracepoints are extremely useful (which I didn't notice > > before). > > Yes, they are! (I forgot to say thanks for teaching me that! :) > > > So here's the final culprit... > > > > void kvm_reset_dirty_gfn(struct kvm *kvm, u32 slot, u64 offset, u64 mask) > > { > > ... > > spin_lock(&kvm->mmu_lock); > > /* FIXME: we should use a single AND operation, but there is no > > * applicable atomic API. > > */ > > while (mask) { > > clear_bit_le(offset + __ffs(mask), memslot->dirty_bitmap); > > mask &= mask - 1; > > } > > > > kvm_arch_mmu_enable_log_dirty_pt_masked(kvm, memslot, offset, mask); > > spin_unlock(&kvm->mmu_lock); > > } > > > > The mask is cleared before reaching > > kvm_arch_mmu_enable_log_dirty_pt_masked().. > > I'm not sure why that results in two vmexits? (clearing before > kvm_arch_mmu_enable_log_dirty_pt_masked is also what > KVM_{GET,CLEAR}_DIRTY_LOG does). Sorry my fault to be not clear on this. The kvm_arch_mmu_enable_log_dirty_pt_masked() only explains why the same page is not written again after the ring-full userspace exit (which triggered the real dirty bit missing), and that's because the write bit is not removed during KVM_RESET_DIRTY_RINGS so the next vmenter will directly write to the previous page without vmexit. The two vmexits is another story - I tracked it is retried because mmu_notifier_seq has changed, hence it goes through this path: if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) goto out_unlock; It's because when try_async_pf(), we will do a writable page fault, which probably triggers both the invalidate_range_end and change_pte notifiers. A reference trace when EPT enabled: kvm_mmu_notifier_change_pte+1 __mmu_notifier_change_pte+82 wp_page_copy+1907 do_wp_page+478 __handle_mm_fault+3395 handle_mm_fault+196 __get_user_pages+681 get_user_pages_unlocked+172 __gfn_to_pfn_memslot+290 try_async_pf+141 tdp_page_fault+326 kvm_mmu_page_fault+115 kvm_arch_vcpu_ioctl_run+2675 kvm_vcpu_ioctl+536 do_vfs_ioctl+1029 ksys_ioctl+94 __x64_sys_ioctl+22 do_syscall_64+91 I'm not sure whether that's ideal, but it makes sense to me. > > > The funny thing is that I did have a few more patches to even skip > > allocate the dirty_bitmap when dirty ring is enabled (hence in that > > tree I removed this while loop too, so that has no such problem). > > However I dropped those patches when I posted the RFC because I don't > > think it's mature, and the selftest didn't complain about that > > either.. Though, I do plan to redo that in v2 if you don't disagree. > > The major question would be whether the dirty_bitmap could still be > > for any use if dirty ring is enabled. > > Userspace may want a dirty bitmap in addition to a list (for example: > list for migration, bitmap for framebuffer update), but it can also do a > pass over the dirty rings in order to update an internal bitmap. > > So I think it make sense to make it either one or the other. Ok, then I'll do. Thanks, -- Peter Xu