Re: [PATCH v2 2/3] KVM: selftests: Randomize which pages are written vs read.

David Matlack <dmatlack@xxxxxxxxxx> · Thu, 8 Sep 2022 12:52:47 -0700

On Thu, Sep 8, 2022 at 12:46 PM Colton Lewis <coltonlewis@xxxxxxxxxx> wrote:
>
> David Matlack <dmatlack@xxxxxxxxxx> writes:
>
> > On Tue, Aug 30, 2022 at 07:02:10PM +0000, Colton Lewis wrote:
> >> David Matlack <dmatlack@xxxxxxxxxx> writes:
>
> >> > On Wed, Aug 17, 2022 at 09:41:45PM +0000, Colton Lewis wrote:
> >> > > Randomize which tables are written vs read using the random number
> >> > > arrays. Change the variable wr_fract and associated function calls to
> >> > > write_percent that now operates as a percentage from 0 to 100 where X
> >> > > means each page has an X% chance of being written. Change the -f
> >> > > argument to -w to reflect the new variable semantics. Keep the same
> >> > > default of 100 percent writes.
>
> >> > Doesn't the new option cause like a 1000x slowdown in "Dirty memory
> >> > time"?  I don't think we should merge this until that is understood and
> >> > addressed (and it should be at least called out here so that reviewers
> >> > can be made aware).
>
>
> >> I'm guessing you got that from my internally posted tests. This option
> >> itself does not cause the slowdown. If this option is set to 0% or 100%
> >> (the default), there is no slowdown at all. The slowdown I measured was
> >> at 50%, probably because that makes branch prediction impossible because
> >> it has an equal chance of doing a read or a write each time. This is a
> >> good thing. It's much more realistic than predictably alternating read
> >> and write.
>
> > I found it hard to believe that branch prediction could affect
> > performance by 1000x (and why wouldn't random access order show the same
> > effect?) so I looked into it further.
>
> > The cause of the slowdown is actually MMU lock contention:
>
> > -   82.62%  [k] queued_spin_lock_slowpath
> >     - 82.09% queued_spin_lock_slowpath
> >        - 48.36% queued_write_lock_slowpath
> >           - _raw_write_lock
> >              - 22.18% kvm_mmu_notifier_invalidate_range_start
> >                   __mmu_notifier_invalidate_range_start
> >                   wp_page_copy
> >                   do_wp_page
> >                   __handle_mm_fault
> >                   handle_mm_fault
> >                   __get_user_pages
> >                   get_user_pages_unlocked
> >                   hva_to_pfn
> >                   __gfn_to_pfn_memslot
> >                   kvm_faultin_pfn
> >                   direct_page_fault
> >                   kvm_tdp_page_fault
> >                   kvm_mmu_page_fault
> >                   handle_ept_violation
>
> > I think the bug is due to the following:
>
> >   1. Randomized reads/writes were being applied to the Populate phase,
> >      which (when using anonymous memory) results in the guest memory being
> >      mapped to the Zero Page.
> >   2. The random access order changed across each iteration (Population
> >      phase included) which means that some pages were written to during
> > each
> >      iteration for the first time. Those pages resulted in a copy-on-write
> >      in the host MM fault handler, which invokes the invalidate range
> >      notifier and acquires the MMU lock in write-mode.
> >   3. All vCPUs are doing this in parallel which results in a ton of lock
> >      contention.
>
> > Your internal test results also showed that performance got better
> > during each iteration. That's because more and more of the guest memory
> > has been faulted in during each iteration (less and less copy-on-write
> > faults that need to acquire the MMU lock in write-mode).
>
>
> Thanks for the analysis David. I had wondered about the effects of
> randomized reads/writes during the populate phase.
>
> > The proper fix for v4 would be to set write-percent to 100 during the
> > populate phase so all memory actually gets populated. Then only use the
> > provided write-percent for testing dirty logging.
>
>
> Will do that.

Sounds good, thanks. I do think that this is an interesting discovery
though that we wouldn't have known about without your random access
series. So I am somewhat tempted to say let's leave the behavior as
is. But doing that will end up conflating two different tests.
dirty_log_perf_test should really isolate the performance of dirty
logging independent of memory population.

It would be interesting though to create a generalized memory access
test that randomizes reads, writes and even execution (to exercise NX
Huge Pages) so that we do have a test that exercises the copy-on-write
behavior. But I think that falls outside the scope of your series.