On Thu, Sep 8, 2022 at 12:46 PM Colton Lewis <coltonlewis@xxxxxxxxxx> wrote: > > David Matlack <dmatlack@xxxxxxxxxx> writes: > > > On Tue, Aug 30, 2022 at 07:02:10PM +0000, Colton Lewis wrote: > >> David Matlack <dmatlack@xxxxxxxxxx> writes: > > >> > On Wed, Aug 17, 2022 at 09:41:45PM +0000, Colton Lewis wrote: > >> > > Randomize which tables are written vs read using the random number > >> > > arrays. Change the variable wr_fract and associated function calls to > >> > > write_percent that now operates as a percentage from 0 to 100 where X > >> > > means each page has an X% chance of being written. Change the -f > >> > > argument to -w to reflect the new variable semantics. Keep the same > >> > > default of 100 percent writes. > > >> > Doesn't the new option cause like a 1000x slowdown in "Dirty memory > >> > time"? I don't think we should merge this until that is understood and > >> > addressed (and it should be at least called out here so that reviewers > >> > can be made aware). > > > >> I'm guessing you got that from my internally posted tests. This option > >> itself does not cause the slowdown. If this option is set to 0% or 100% > >> (the default), there is no slowdown at all. The slowdown I measured was > >> at 50%, probably because that makes branch prediction impossible because > >> it has an equal chance of doing a read or a write each time. This is a > >> good thing. It's much more realistic than predictably alternating read > >> and write. > > > I found it hard to believe that branch prediction could affect > > performance by 1000x (and why wouldn't random access order show the same > > effect?) so I looked into it further. > > > The cause of the slowdown is actually MMU lock contention: > > > - 82.62% [k] queued_spin_lock_slowpath > > - 82.09% queued_spin_lock_slowpath > > - 48.36% queued_write_lock_slowpath > > - _raw_write_lock > > - 22.18% kvm_mmu_notifier_invalidate_range_start > > __mmu_notifier_invalidate_range_start > > wp_page_copy > > do_wp_page > > __handle_mm_fault > > handle_mm_fault > > __get_user_pages > > get_user_pages_unlocked > > hva_to_pfn > > __gfn_to_pfn_memslot > > kvm_faultin_pfn > > direct_page_fault > > kvm_tdp_page_fault > > kvm_mmu_page_fault > > handle_ept_violation > > > I think the bug is due to the following: > > > 1. Randomized reads/writes were being applied to the Populate phase, > > which (when using anonymous memory) results in the guest memory being > > mapped to the Zero Page. > > 2. The random access order changed across each iteration (Population > > phase included) which means that some pages were written to during > > each > > iteration for the first time. Those pages resulted in a copy-on-write > > in the host MM fault handler, which invokes the invalidate range > > notifier and acquires the MMU lock in write-mode. > > 3. All vCPUs are doing this in parallel which results in a ton of lock > > contention. > > > Your internal test results also showed that performance got better > > during each iteration. That's because more and more of the guest memory > > has been faulted in during each iteration (less and less copy-on-write > > faults that need to acquire the MMU lock in write-mode). > > > Thanks for the analysis David. I had wondered about the effects of > randomized reads/writes during the populate phase. > > > The proper fix for v4 would be to set write-percent to 100 during the > > populate phase so all memory actually gets populated. Then only use the > > provided write-percent for testing dirty logging. > > > Will do that. Sounds good, thanks. I do think that this is an interesting discovery though that we wouldn't have known about without your random access series. So I am somewhat tempted to say let's leave the behavior as is. But doing that will end up conflating two different tests. dirty_log_perf_test should really isolate the performance of dirty logging independent of memory population. It would be interesting though to create a generalized memory access test that randomizes reads, writes and even execution (to exercise NX Huge Pages) so that we do have a test that exercises the copy-on-write behavior. But I think that falls outside the scope of your series.