Re: [Qemu-devel] vm performance degradation after kvm live migration or save-restore with ETP enabled

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Tue, 30 Jul 2013 00:14:05 +0200

Hi,

On Sat, Jul 27, 2013 at 07:47:49AM +0000, Zhanghaoyu (A) wrote:
> >> hi all,
> >> 
> >> I met similar problem to these, while performing live migration or 
> >> save-restore test on the kvm platform (qemu:1.4.0, host:suse11sp2, 
> >> guest:suse11sp2), running tele-communication software suite in guest, 
> >> https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg00098.html
> >> http://comments.gmane.org/gmane.comp.emulators.kvm.devel/102506
> >> http://thread.gmane.org/gmane.comp.emulators.kvm.devel/100592
> >> https://bugzilla.kernel.org/show_bug.cgi?id=58771
> >> 
> >> After live migration or virsh restore [savefile], one process's CPU 
> >> utilization went up by about 30%, resulted in throughput degradation 
> >> of this process.
> >> 
> >> If EPT disabled, this problem gone.
> >> 
> >> I suspect that kvm hypervisor has business with this problem.
> >> Based on above suspect, I want to find the two adjacent versions of 
> >> kvm-kmod which triggers this problem or not (e.g. 2.6.39, 3.0-rc1), 
> >> and analyze the differences between this two versions, or apply the 
> >> patches between this two versions by bisection method, finally find the key patches.
> >> 
> >> Any better ideas?
> >> 
> >> Thanks,
> >> Zhang Haoyu
> >
> >I've attempted to duplicate this on a number of machines that are as similar to yours as I am able to get my hands on, and so far have not been able to see any performance degradation. And from what I've read in the above links, huge pages do not seem to be part of the problem.
> >
> >So, if you are in a position to bisect the kernel changes, that would probably be the best avenue to pursue in my opinion.
> >
> >Bruce
> 
> I found the first bad commit([612819c3c6e67bac8fceaa7cc402f13b1b63f7e4] KVM: propagate fault r/w information to gup(), allow read-only memory) which triggers this problem 
> by git bisecting the kvm kernel (download from https://git.kernel.org/pub/scm/virt/kvm/kvm.git) changes.
> 
> And, 
> git log 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4 -n 1 -p > 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4.log
> git diff 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4~1..612819c3c6e67bac8fceaa7cc402f13b1b63f7e4 > 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4.diff
> 
> Then, I diffed 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4.log and 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4.diff, 
> came to a conclusion that all of the differences between 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4~1 and 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4
> are contributed by no other than 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4, so this commit is the peace-breaker which directly or indirectly causes the degradation.

Something is generating readonly host ptes for this to make a
difference. Considering live migrate or startup actions are involved
the most likely culprit is fork() to start some script or something.

forks would mark all the pte readonly and invalidate the spte with the
mmu notifier.

So then with all spte dropped, and the whole guest address space
mapped readonly, depending on the app, sometime we could have a vmexit
to establish a readonly spte on the readonly pte, and then another
vmexit to execute the COW at the first write fault that follows.

But it won't run a COW unless the child is still there (and normally
child does fork() + quick stuff + exec(), so child is unlikely to be
still there).

But it's still 2 vmexits when before there was just 1 vmexit.

The same overhead should happen for both EPT and no-EPT, there would
be two vmexits in no-EPT case, there's no way spte can be marked
writable if the host pte is still readonly.

If you get an massive overhead and CPU loop in host kernel mode, maybe
a global tlb flush is missing that get rid of the readonly copy of the
spte in the CPU and all CPUs tends to exit on the same spte at the
same time. Or we may lack the tlb flush even for the current CPU but
we should really flush them all (in the old days the current CPU TLB
flush was implicit in the vmexit but CPU got more features)?

I don't know exactly which kind of overhead we're talking about but
the double number of vmexit would probably not be measurable. If you
monitor the number of vmexits if it's a missing TLB flush you'll see a
flood, otherwise you'll just the double amount before/after that commit.

If the readonly pte generator is fork and it's just the double number
of vmexit the only thing you need is the patch I posted a few days ago
that adds the missing madvise(MADV_DONTFORK).

If instead the overhead is massive and it's a vmexit flood, we also
have a missing tlb flush. In that case let's fix the tlb flush first,
and then you can still apply the MADV_DONTFORK. This kind of fault
activity also happens after a swapin from readonly swapcache so if
there's a vmexit flood we need to fix it before applying
MADV_DONTFORK.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html