Re: [PATCH v2] kvm/selftests: Fix race condition with dirty_log_test

Peter Xu <peterx@xxxxxxxxxx> · Sat, 17 Apr 2021 10:09:56 -0400

Paolo,

Please hold-on with this patch since I got another report that this patch can
still trigger test failure if even heavier workload (e.g., "taskset -c 0
./dirty_log_test" plus another one or multiple "taskset -c 0 while :; do:;
done").  So I plan to do it in another way.  I tested longer yesterday but
haven't updated this patch yet.  More below.

On Sat, Apr 17, 2021 at 02:59:48PM +0200, Paolo Bonzini wrote:
> On 13/04/21 23:36, Peter Xu wrote:
> > This patch closes this race by allowing the main thread to give the vcpu thread
> > chance to do a VMENTER to complete that write operation.  It's done by adding a
> > vcpu loop counter (must be defined as volatile as main thread will do read
> > loop), then the main thread can guarantee the vcpu got at least another VMENTER
> > by making sure the guest_vcpu_loops increases by 2.
> > 
> > Dirty ring does not need this since dirty_ring_last_page would already help
> > avoid this specific race condition.
> 
> Just a nit, the comment and commit message should mention KVM_RUN rather
> than vmentry; it's possible to be preempted many times in vcpu_enter_guest
> without making progress, but those wouldn't return to userspace and thus
> would not update guest_vcpu_loops.

But what I really wanted to emphasize is the vmentry point rather than KVM_RUN,
e.g., KVM_RUN can return without an vmentry, while the vmentry is the exactly
point that data will be flushed.

> 
> Also, volatile is considered harmful even in userspace/test code[1].
> Technically rather than volatile one should use an atomic load (even a
> relaxed one), but in practice it's okay to use volatile too *for this
> specific use* (READ_ONCE/WRITE_ONCE are volatile reads and writes as well).
> If the selftests gained 32-bit support, one should not use volatile because
> neither reads or writes to uint64_t variables would be guaranteed to be
> atomic.

Indeed!  I'll start to use atomics.

Regarding why this patch won't really solve all race conditions... The problem
is I think one guest memory write operation (of this specific test) contains a
few micro-steps when page is during kvm dirty tracking (here I'm only
considering write-protect rather than pml but pml should be similar at least
when the log buffer is full):

  (1) Guest read 'iteration' number into register, prepare to write, page fault
  (2) Set dirty bit in either dirty bitmap or dirty ring
  (3) Return to guest, data written

When we verify the data, we assumed that all these steps are "atomic", say,
when (1) happened for this page, we assume (2) & (3) must have happened.  We
had some trick to workaround "un-atomicity" of above three steps, as this patch
wanted to fix atomicity of step (2)+(3) by explicitly letting the main thread
wait for at least one vmenter of vcpu thread, which should work.  However what
I overlooked is probably that we still have race when (1) and (2) can be
interrupted.

As an example of how step (1) and (2) got interrupted, I simply tried to trace
kvm_vcpu_mark_page_dirty() and dump stack for vmexit cases, then we can see at
least a bunch of cases where vcpu can be scheduled out even before setting the
dirty bit:

@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    kmem_cache_alloc+583
    kvm_mmu_topup_memory_cache+33
    direct_page_fault+237
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 4
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    down_read+14
    get_user_pages_unlocked+90
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 23
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    __alloc_pages+663
    alloc_pages_vma+128
    wp_page_copy+773
    __handle_mm_fault+3155
    handle_mm_fault+151
    __get_user_pages+664
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 1406
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    hva_to_pfn+157
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 2579
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    mmu_notifier_invalidate_range_start+9
    wp_page_copy+296
    __handle_mm_fault+3155
    handle_mm_fault+151
    __get_user_pages+664
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 3309
@out[
    __schedule+1742
    __schedule+1742
    __cond_resched+52
    __get_user_pages+530
    get_user_pages_unlocked+197
    hva_to_pfn+206
    try_async_pf+132
    direct_page_fault+320
    kvm_mmu_page_fault+103
    vmx_handle_exit+288
    vcpu_enter_guest+2460
    kvm_arch_vcpu_ioctl_run+325
    kvm_vcpu_ioctl+526
    __x64_sys_ioctl+131
    do_syscall_64+51
    entry_SYSCALL_64_after_hwframe+68
]: 4499

It means... it can always happen that the vcpu reads a very old "iteration"
value in step 1 and it doesn't set dirty bit (step 2) or write it to memory
(step 3) until, say, 1 year later.. :) Then the verify won't pass since the
main thread iteration has been much newer, then main thread shouts at us.

So far I don't see an easy way to guarantee all steps 1-3 atomicity (as this
patch only achieved steps 2-3), but to sync at the GUEST_SYNC() point of guest
code when we do verification of the dirty bits.  Drew mentioned something like
this previously in the bugzilla, I wanted to give it a shot with a lighter
sync, but seems not working.

Paolo, Drew, Sean - feel free to shoot if any of you have a better idea.

As I mentioned I tested v2 of this patch and so far no issue found.  I'll post
it later today, so maybe we can continue discuss there too (btw, I also found
another signal race there; so I'll post a series with 2 patches).

Thanks,

-- 
Peter Xu