RE: [Qemu-devel] vm performance degradation after kvm live migration or save-restore with EPT enabled

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>>> >>> The QEMU command line (/var/log/libvirt/qemu/[domain name].log), 
>>> >>> LC_ALL=C PATH=/bin:/sbin:/usr/bin:/usr/sbin HOME=/ 
>>> >>> QEMU_AUDIO_DRV=none
>>> >>> /usr/local/bin/qemu-system-x86_64 -name ATS1 -S -M pc-0.12 -cpu
>>> >>> qemu32 -enable-kvm -m 12288 -smp 4,sockets=4,cores=1,threads=1 
>>> >>> -uuid
>>> >>> 0505ec91-382d-800e-2c79-e5b286eb60b5 -no-user-config -nodefaults 
>>> >>> -chardev 
>>> >>> socket,id=charmonitor,path=/var/lib/libvirt/qemu/ATS1.monitor,ser
>>> >>> ver, n owait -mon chardev=charmonitor,id=monitor,mode=control 
>>> >>> -rtc base=localtime -no-shutdown -device
>>> >>> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive 
>>> >>> file=/opt/ne/vm/ATS1.img,if=none,id=drive-virtio-disk0,format=raw
>>> >>> ,cac
>>> >>> h
>>> >>> e=none -device
>>> >>> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x8,drive=drive-virtio-dis
>>> >>> k0,i
>>> >>> d
>>> >>> =virtio-disk0,bootindex=1 -netdev
>>> >>> tap,fd=20,id=hostnet0,vhost=on,vhostfd=21 -device 
>>> >>> virtio-net-pci,netdev=hostnet0,id=net0,mac=00:e0:fc:00:0f:00,bus=pci.
>>> >>> 0
>>> >>> ,addr=0x3,bootindex=2 -netdev
>>> >>> tap,fd=22,id=hostnet1,vhost=on,vhostfd=23 -device 
>>> >>> virtio-net-pci,netdev=hostnet1,id=net1,mac=00:e0:fc:01:0f:00,bus=pci.
>>> >>> 0
>>> >>> ,addr=0x4 -netdev tap,fd=24,id=hostnet2,vhost=on,vhostfd=25 
>>> >>> -device virtio-net-pci,netdev=hostnet2,id=net2,mac=00:e0:fc:02:0f:00,bus=pci.
>>> >>> 0
>>> >>> ,addr=0x5 -netdev tap,fd=26,id=hostnet3,vhost=on,vhostfd=27 
>>> >>> -device virtio-net-pci,netdev=hostnet3,id=net3,mac=00:e0:fc:03:0f:00,bus=pci.
>>> >>> 0
>>> >>> ,addr=0x6 -netdev tap,fd=28,id=hostnet4,vhost=on,vhostfd=29 
>>> >>> -device virtio-net-pci,netdev=hostnet4,id=net4,mac=00:e0:fc:0a:0f:00,bus=pci.
>>> >>> 0
>>> >>> ,addr=0x7 -netdev tap,fd=30,id=hostnet5,vhost=on,vhostfd=31 
>>> >>> -device virtio-net-pci,netdev=hostnet5,id=net5,mac=00:e0:fc:0b:0f:00,bus=pci.
>>> >>> 0
>>> >>> ,addr=0x9 -chardev pty,id=charserial0 -device 
>>> >>> isa-serial,chardev=charserial0,id=serial0 -vnc *:0 -k en-us -vga 
>>> >>> cirrus -device i6300esb,id=watchdog0,bus=pci.0,addr=0xb
>>> >>> -watchdog-action poweroff -device 
>>> >>> virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0xa
>>> >>> 
>>> >>Which QEMU version is this? Can you try with e1000 NICs instead of virtio?
>>> >>
>>> >This QEMU version is 1.0.0, but I also test QEMU 1.5.2, the same problem exists, including the performance degradation and readonly GFNs' flooding.
>>> >I tried with e1000 NICs instead of virtio, including the performance degradation and readonly GFNs' flooding, the QEMU version is 1.5.2.
>>> >No matter e1000 NICs or virtio NICs, the GFNs' flooding is initiated at post-restore stage (i.e. running stage), as soon as the restoring completed, the flooding is starting.
>>> >
>>> >Thanks,
>>> >Zhang Haoyu
>>> >
>>> >>--
>>> >>			Gleb.
>>> 
>>> Should we focus on the first bad commit(612819c3c6e67bac8fceaa7cc402f13b1b63f7e4) and the surprising GFNs' flooding?
>>> 
>>Not really. There is no point in debugging very old version compiled 
>>with kvm-kmod, there are to many variables in the environment. I cannot 
>>reproduce the GFN flooding on upstream, so the problem may be gone, may 
>>be a result of kvm-kmod problem or something different in how I invoke 
>>qemu. So the best way to proceed is for you to reproduce with upstream 
>>version then at least I will be sure that we are using the same code.
>>
>Thanks, I will test the combos of upstream kvm kernel and upstream qemu.
>And, the guest os version above I said was wrong, current running guest os is SLES10SP4.
>
I tested below combos of qemu and kernel,
+-----------------+-----------------+-----------------+
|  kvm kernel     |      QEMU       |   test result   |
+-----------------+-----------------+-----------------+
|  kvm-3.11-2     |   qemu-1.5.2    |      GOOD       |
+-----------------+-----------------+-----------------+
|  SLES11SP2      |   qemu-1.0.0    |      BAD        |
+-----------------+-----------------+-----------------+
|  SLES11SP2      |   qemu-1.4.0    |      BAD        |
+-----------------+-----------------+-----------------+
|  SLES11SP2      |   qemu-1.4.2    |      BAD        |
+-----------------+-----------------+-----------------+
|  SLES11SP2      | qemu-1.5.0-rc0  |      GOOD       |
+-----------------+-----------------+-----------------+
|  SLES11SP2      |   qemu-1.5.0    |      GOOD       |
+-----------------+-----------------+-----------------+
|  SLES11SP2      |   qemu-1.5.1    |      GOOD       |
+-----------------+-----------------+-----------------+
|  SLES11SP2      |   qemu-1.5.2    |      GOOD       |
+-----------------+-----------------+-----------------+
NOTE:
1. above kvm-3.11-2 in the table is the whole tag kernel download from https://git.kernel.org/pub/scm/virt/kvm/kvm.git
2. SLES11SP2's kernel version is 3.0.13-0.27

Then I git bisect the qemu changes between qemu-1.4.2 and qemu-1.5.0-rc0 by marking the good version as bad, and the bad version as good,
so the first bad commit is just the patch which fixes the degradation problem.
+------------+-------------------------------------------+-----------------+-----------------+
| bisect No. |                  commit                   |  save-restore   |    migration    |
+------------+-------------------------------------------+-----------------+-----------------+
|      1     | 03e94e39ce5259efdbdeefa1f249ddb499d57321  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      2     | 99835e00849369bab726a4dc4ceed1f6f9ed967c  |      GOOD       |       GOOD      |
+------------+-------------------------------------------+-----------------+-----------------+
|      3     | 62e1aeaee4d0450222a0ea43c713b59526e3e0fe  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      4     | 9d9801cf803cdceaa4845fe27150b24d5ab083e6  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      5     | d76bb73549fcac07524aea5135280ea533a94fd6  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      6     | d913829f0fd8451abcb1fd9d6dfce5586d9d7e10  |      GOOD       |       GOOD      |
+------------+-------------------------------------------+-----------------+-----------------+
|      7     | d2f38a0acb0a1c5b7ab7621a32d603d08d513bea  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      8     | e344b8a16de429ada3d9126f26e2a96d71348356  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      9     | 56ded708ec38e4cb75a7c7357480ca34c0dc6875  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      10    | 78d07ae7ac74bcc7f79aeefbaff17fb142f44b4d  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      11    | 70c8652bf3c1fea79b7b68864e86926715c49261  |      GOOD       |       GOOD      |
+------------+-------------------------------------------+-----------------+-----------------+
|      12    | f1c72795af573b24a7da5eb52375c9aba8a37972  |      GOOD       |       GOOD      |
+------------+-------------------------------------------+-----------------+-----------------+
NOTE: above tests were made on SLES11SP2.

So, the commit f1c72795af573b24a7da5eb52375c9aba8a37972 is just the patch which fixes the degradation.

Then, I replace SLES11SP2's default kvm-kmod with kvm-kmod-3.6, and applied below patch to __direct_map(),
@@ -2599,6 +2599,9 @@ static int __direct_map(struct kvm_vcpu
        int emulate = 0;
        gfn_t pseudo_gfn;

+        if (!map_writable)
+                printk(KERN_ERR "%s: %s: gfn = %llu \n", __FILE__, __func__, gfn);
+
        for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
                if (iterator.level == level) {
                        unsigned pte_access = ACC_ALL;
and, I rebuild the kvm-kmod, then re-insmod it, test the adjacent commits again, test results shown as below,
+------------+-------------------------------------------+-----------------+-----------------+
| bisect No. |                  commit                   |  save-restore   |    migration    |
+------------+-------------------------------------------+-----------------+-----------------+
|      10    | 78d07ae7ac74bcc7f79aeefbaff17fb142f44b4d  |      BAD        |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
|      12    | f1c72795af573b24a7da5eb52375c9aba8a37972  |      GOOD       |       BAD       |
+------------+-------------------------------------------+-----------------+-----------------+
While testing commit 78d07ae7ac74bcc7f79aeefbaff17fb142f44b4d, as soon as the restoration/migration complete, the GFNs flooding is starting,
take some examples shown as below,
2073462
2857203
2073463
2073464
2073465
3218751
2073466
2857206
2857207
2073467
2073468
2857210
2857211
3218752
2857214
2857215
3218753
2857217
2857218
2857221
2857222
3218754
2857225
2857226
3218755
2857229
2857230
2857232
2857233
3218756
2780393
2780394
2857236
2780395
2857237
2780396
2780397
2780398
2780399
2780400
2780401
3218757
2857240
2857241
2857244
3218758
2857247
2857248
2857251
2857252
3218759
2857255
2857256
3218760
2857289
2857290
2857293
2857294
3218761
2857297
2857298
3218762
3218763
3218764
3218765
3218766
3218767
3218768
3218769
3218770
3218771
3218772

but, after a period of time, the flooding rate slowed down.

while testing commit f1c72795af573b24a7da5eb52375c9aba8a37972, after restoration, no GFN was printed, and no performance degradation.
but as soon as live migration complete, GFNs flooding is starting, and performance degradation also happened.

NOTE: The test results of commit f1c72795af573b24a7da5eb52375c9aba8a37972 seemed to be unstable, I will make verification again.


>Thanks,
>Zhang Haoyu
>
>>> I applied below patch to  __direct_map(), @@ -2223,6 +2223,8 @@ 
>>> static int __direct_map(struct kvm_vcpu
>>>         int pt_write = 0;
>>>         gfn_t pseudo_gfn;
>>> 
>>> +        map_writable = true;
>>> +
>>>         for_each_shadow_entry(vcpu, (u64)gfn << PAGE_SHIFT, iterator) {
>>>                 if (iterator.level == level) {
>>>                         unsigned pte_access = ACC_ALL; and rebuild 
>>> the kvm-kmod, then re-insmod it.
>>> After I started a VM, the host seemed to be abnormal, so many programs cannot be started successfully, segmentation fault is reported.
>>> In my opinion, after above patch applied, the commit: 612819c3c6e67bac8fceaa7cc402f13b1b63f7e4 should be of no effect, but the test result proved me wrong.
>>> Dose the map_writable value's getting process in hva_to_pfn() have effect on the result?
>>> 
>>If hva_to_pfn() returns map_writable == false it means that page is 
>>mapped as read only on primary MMU, so it should not be mapped writable 
>>on secondary MMU either. This should not happen usually.
>>
>>--
>>			Gleb.
��.n��������+%������w��{.n�����o�^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�


[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux