Re: lvm cache + qemu-kvm stops working after about 20GB of writes

Richard Landsman - Rimote <richard@xxxxxxxxx> · Thu, 20 Apr 2017 12:32:13 +0200



    Hello everyone,
    Anybody had the chance to test out this setup and reproduce the
      problem? I assumed it would be something that's used often these
      days and a solution would benefit a lot of users. If can be of any
      assistance please contact me. 

    
    -- 
Met vriendelijke groet,

Richard Landsman
http://rimote.nl

T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)

24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates) 
    On 04/10/2017 10:08 AM, Sandro
      Bonazzola wrote:

    
      Adding Paolo and Miroslav.
      

        On Sat, Apr 8, 2017 at 4:49 PM, Richard
          Landsman - Rimote <richard@xxxxxxxxx> wrote:

          
              Hello,
              I would really appreciate some help/guidance with this
                problem. First of all sorry for the long message. I
                would file a bug, but do not know if it is my fault,
                dm-cache, qemu or (probably) a combination of both. And
                i can imagine some of you have this setup up and running
                without problems (or maybe you think it works, just like
                i did, but it does not):
              PROBLEM

                LVM cache writeback stops working as expected after a
                while with a qemu-kvm VM. A 100% working setup would be
                the holy grail in my opinion... and the performance of
                KVM/qemu is great i must say in the beginning.

              
              DESCRIPTION
              When using software RAID 1 (2x HDD) + software RAID 1
                (2xSSD) and create a cached LV out of them, the VM
                performs initially great (at least 40.000 IOPS on 4k
                rand read/write)! But then after a while (and a lot of
                random IO, ca 10 - 20 G) it effectively turns in to a
                writethrough cache although there's much space left on
                the cachedlv.
              

              When  working as expected on KVM host all writes go to
                SSDs
              iostat -x -m 2

                
                Device:         rrqm/s   wrqm/s     r/s     w/s   
                rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await
                w_await  svctm  %util

                sda               0.00   324.50    0.00   22.00    
                0.00    14.94  1390.57     1.90   86.39    0.00  
                86.39   5.32  11.70

                sdb               0.00   324.50    0.00   22.00    
                0.00    14.94  1390.57     2.03   92.45    0.00  
                92.45   5.48  12.05

                sdc               0.00  3932.00    0.00 2191.50    
                0.00   270.07   252.39    37.83   17.55   
                0.00   17.55   0.36  78.05

                sdd               0.00  3932.00    0.00 2197.50    
                0.00   271.01   252.57    38.96   18.14   
                0.00   18.14   0.36  78.95

              
              When not working as expected on KVM host all writes go
                through the SSD on to the HDDs (effectively disabling
                writeback so it becomes a writethrough)

              
              Device:         rrqm/s   wrqm/s     r/s     w/s   
                rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await
                w_await  svctm  %util

                sda               0.00     7.00  234.50  173.50    
                0.92     1.95    14.38    29.27   71.27  111.89  
                16.37   2.45 100.00

                sdb               0.00     3.50  212.00  177.50    
                0.83     1.95    14.60    35.58   91.24  143.00  
                29.42   2.57 100.10

                sdc               2.50     0.00  566.00  199.00    
                2.69     0.78     9.28     0.08    0.11    0.13   
                0.04   0.10   7.70

                sdd               1.50     0.00   76.00  199.00    
                0.65     0.78    10.66     0.02    0.07    0.16   
                0.04   0.07   1.85
              

              Stuff i've checked/tried:
              - The data in the cached LV has then not exceeded even
                half of the space, so this should not happen. It even
                happens when only 20% of cachedata is used.

                - It seems to be triggerd most of the time when
                %cpy/sync column of `lvs -a` is about 30%. But this is
                not always the case!

                - changing the cachepolicy from cleaner to smq, wait
                (check flush ready with lvs -a) and then back to smq
                seems to help sometimes! But not always...

              
              lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv
              lvs -a

              
              lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv

                
                - when mounting the LV inside the host this does not
                  seem to happen!! So it looks like a qemu-kvm /
                dm-cache combination issue. Only difference is that
                inside host i do mkfs in stead of LVM inside VM (so
                could be LVM inside VM on top of LVM on KVM host problem
                too? small chance probably because the first 10 - 20GB
                it works great!)

              
              - tried disabling Selinux, upgrading to newest kernels
                (elrepo ml and lt), played around with dirty_cache
                thingeys like proc/sys/vm/dirty_writeback_centisecs
                /proc/sys/vm/dirty_expire_centisecs cat
                /proc/sys/vm/dirty_ratio , and migration threashold of
                dmsetup, and other probably non important stuff like
                vm.dirty_bytes

              
              - when in "slow state" the systems kworkers are
                exessively using IO (10 - 20 MB per kworker process).
                This seems to be the writeback process (CPY%Sync)
                because the cache wants to flush to HDD. But the strange
                thing is that after a good sync (0% left), the disk may
                become slow again after a few MBs of data. A reboot
                sometimes helps.
              - have tried iothreads, virtio-scsi, vcpu driver
                setting on virtio-scsi controller, cachesettings, disk
                shedulers etc. Nothing helped.
              - the new samsung 950 PRO SSDs have HPA enabled
                (30%!!), i have AMD FX(tm)-8350, 16G RAM

              
              It feels like the lvm cache has a threshold (about 20G
                of data that is dirty) and that is stops allowing the
                qemu-kvm process to use writeback caching (the root uses
                inside the host seems to not have this limitation). It
                starts flushing, but only to a certain point. After a
                few  MBs of data it is right back in the slow spot
                again. Only solution is waiting for a long time
                (independant of CPY%SYNC) or sometimes change
                cachepolicy and force flush. This prevents for me the
                production use of this system. But it's so promising, so
                I hope somebody can help.

              
              desired state:  Doing the FIO test (described in
                section reproduce) repeatedly should keep being fast
                till cachedlv is more or less full. If resyncing back to
                disc causes this degradation, it should actually flush
                it fully within a reasonable time and give opportunity
                to write fast again up to a given threshold. It now
                seems like a one time use cache that only uses a
                fraction of the SSD and is useless/very unstable
                afterwards.
              REPRODUCE

                1. Install newest CentOS 7 on software RAID 1 HDDs with
                LVM. Keep a lot of space for the LVM cache (no /home)!
                So make the VG as large as possible during anaconda
                partitioning. 

              
              2. once installed and booted in to the system, install
                qemu-kvm

              
              yum install -y centos-release-qemu-ev

                yum install -y qemu-kvm-ev libvirt bridge-utils
                net-tools

                # disbale ksm (probably not important / needed)

                systemctl disable ksm

                systemctl disable ksmtuned

              
              3. create LVM cache
              #set some variables and create a raid1 array with the
                two SSDs

              
              VGBASE= && ssddevice1=/dev/sdX1 &&
                ssddevice2=/dev/sdX1 && hddraiddevice=/dev/mdXXX
                && ssdraiddevice=/dev/mdXXX && mdadm
                --create --verbose ${ssdraiddevice} --level=mirror
                --bitmap=none --raid-devices=2 ${ssddevice1}
                ${ssddevice2}
              # create PV and extend VG

              
               pvcreate ${ssdraiddevice} && vgextend
                ${VGBASE} ${ssdraiddevice}

              
              # create slow LV on HDDs (use max space left if you
                want)

              
               pvdisplay ${hddraiddevice}

                 lvcreate -lXXXX -n cachedlv ${VGBASE} ${hddraiddevice}
              # create the meta and data: for testing purposes I keep
                about 20G of the SSD for a uncached lv. To rule out it
                is not the SSD.

              
              lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice}
              #The rest can be used as cachedata/metadata.

              
               pvdisplay ${ssdraiddevice}

                # about 1/1000 of the space you have left on the SSD for
                the meta (minimum of 4)

                 lvcreate -l X -n cachemeta ${VGBASE} ${ssdraiddevice}

                # the rest can be used as cachedata      

                 lvcreate -l XXX -n cachedata ${VGBASE} ${ssdraiddevice}
              # convert/combine pools so cachedlv is actually cached

              
               lvconvert --type cache-pool --cachemode writeback
                --poolmetadata ${VGBASE}/cachemeta ${VGBASE}/cachedata
               lvconvert --type cache --cachepool ${VGBASE}/cachedata
                ${VGBASE}/cachedlv
              

              # my system now looks like (VG is called cl, default of
              installer)

              [root@localhost ~]# lvs -a

                  LV                VG Attr       LSize   Pool       
                Origin           

                  [cachedata]       cl Cwi---C--- 
                97.66g                                      

                  [cachedata_cdata] cl Cwi-ao---- 
                  97.66g                                                                    
                

                  [cachedata_cmeta] cl ewi-ao---- 100.00m                                                                    
                

                  cachedlv          cl Cwi-aoC---   1.75t [cachedata]
                  [cachedlv_corig]     

                  [cachedlv_corig]  cl owi-aoC---  
                1.75t                                                                    
                

                  [lvol0_pmspare]   cl ewi-------
                100.00m                                                                    
                

                  root              cl -wi-ao---- 
                46.56g                                                                    
                

                  swap              cl -wi-ao---- 
                14.96g                                                                    
                

                  testssd           cl -wi-a-----  45.47g

                  
                [root@localhost ~]#lsblk

                
                NAME                     MAJ:MIN RM   SIZE RO TYPE 
                MOUNTPOINT

                sdd                        8:48   0   163G  0 disk  

                └─sdd1                     8:49   0   163G  0 part  

                  └─md128                  9:128  0 162.9G  0 raid1 

                    ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm   

                    │ └─cl-cachedlv      253:6    0   1.8T  0 lvm   

                    ├─cl-testssd         253:2    0  45.5G  0 lvm   

                    └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm   

                      └─cl-cachedlv      253:6    0   1.8T  0 lvm   

                sdb                        8:16   0   1.8T  0 disk  

                ├─sdb2                     8:18   0   1.8T  0 part  

                │ └─md127                  9:127  0   1.8T  0 raid1 

                │   ├─cl-swap            253:1    0    15G  0 lvm  
                [SWAP]

                │   ├─cl-root            253:0    0  46.6G  0 lvm   /

                │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm   

                │     └─cl-cachedlv      253:6    0   1.8T  0 lvm   

                └─sdb1                     8:17   0   954M  0 part  

                  └─md126                  9:126  0   954M  0 raid1
                /boot

                sdc                        8:32   0   163G  0 disk  

                └─sdc1                     8:33   0   163G  0 part  

                  └─md128                  9:128  0 162.9G  0 raid1 

                    ├─cl-cachedata_cmeta 253:4    0   100M  0 lvm   

                    │ └─cl-cachedlv      253:6    0   1.8T  0 lvm   

                    ├─cl-testssd         253:2    0  45.5G  0 lvm   

                    └─cl-cachedata_cdata 253:3    0  97.7G  0 lvm   

                      └─cl-cachedlv      253:6    0   1.8T  0 lvm   

                sda                        8:0    0   1.8T  0 disk  

                ├─sda2                     8:2    0   1.8T  0 part  

                │ └─md127                  9:127  0   1.8T  0 raid1 

                │   ├─cl-swap            253:1    0    15G  0 lvm  
                [SWAP]

                │   ├─cl-root            253:0    0  46.6G  0 lvm   /

                │   └─cl-cachedlv_corig  253:5    0   1.8T  0 lvm   

                │     └─cl-cachedlv      253:6    0   1.8T  0 lvm   

                └─sda1                     8:1    0   954M  0 part  

                  └─md126                  9:126  0   954M  0 raid1
                /boot

              
              # now create vm

              wget
              http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso
              -P /home/

              DISK=/dev/mapper/XXXX-cachedlv

              
              # watch out, my netsetup uses a custom bridge/network in
              the following command. Please replace with what you
              normally use.

              virt-install -n CentOS1 -r 12000 --os-variant=centos6.7
              --vcpus 7 --disk path=${DISK},cache=none,bus=virtio
              --network bridge=pubbr,model=virtio --cdrom
              /home/CentOS-6.9-x86_64-minimal.iso --graphics
              vnc,port=5998,listen=0.0.0.0 --cpu host 

              
              # now connect with client PC to qemu

              virt-viewer --connect=qemu+ssh://root@192.168.0.XXX/system
              --name CentOS1

              
              And install everything on the single vda disc with LVM (i
              use defaults in anaconda, but remove the large /home to
              prevent SSD beeing over used). 

              
              After install and reboot log in to VM and

              
              yum install epel-release -y && yum install screen
              fio htop -y

              
              and then run disk test:

              
              fio --randrepeat=1 --ioengine=libaio --direct=1
              --gtod_reduce=1 --name=test --filename=test
              --bs=4k --iodepth=64 --size=4G --readwrite=randrw
              --rwmixread=75

              
              then keep repeating but change the filename
              attribute so it does not use the same blocks over and over
              again. 

              
              In the beginning the performance is great!! Wow, very
              impressive 150MB/s 4k random r/w (close to bare metal,
              about 20% - 30% loss). But after a few (usually about 4 or
              5) runs (always changing the filename, but not overfilling
              the FS, it drops to about 10 MBs/sec. 

              
              normal/in the beginning

              
               read : io=3073.2MB, bw=183085KB/s, iops=45771 ,
              runt= 17188msec

                write: io=1022.1MB, bw=60940KB/s, iops=15235 ,
              runt= 17188msec

              
              but then

              
               read : io=3073.2MB, bw=183085KB/s, iops=2904
              , runt= 17188msec

                write: io=1022.1MB, bw=60940KB/s, iops=1751 ,
              runt= 17188msec

              
              or even worse up to the point that it is actually the HDD
              that is written to (about 500 iops).

              
              P.S. when a test is/was slow, that means it is on the
              HDDs. So even when fixing the problem (sometimes just by
              waiting), that specific file will keep being slow when
              redoing the test till its promoted to the lvm cache (takes
              a lot of reads I think). And once on the SSD it sometimes
              keeps beeing fast, although a new testfile will be slow.
              So I really recommend changing the testfile all the time
              when trying to see if a change in speed has occurred. 

               
                  -- 
Met vriendelijke groet,

Richard Landsman
http://rimote.nl

T: +31 (0)50 - 763 04 07
(ma-vr 9:00 tot 18:00)

24/7 bij storingen:
+31 (0)6 - 4388 7949
@RimoteSaS (Twitter Serviceberichten/security updates) 
                
            
            _______________________________________________

            CentOS-virt mailing list

            CentOS-virt@xxxxxxxxxx

            https://lists.centos.org/mailman/listinfo/centos-virt

            
        -- 

        
                      SANDRO BONAZZOLA
                      ASSOCIATE
                          MANAGER, SOFTWARE ENGINEERING, EMEA ENG
                          VIRTUALIZATION R&D
                      Red Hat EMEA
                      
                        
                              TRIED. TESTED.
                                  TRUSTED.
                            
                          
      _______________________________________________
CentOS-virt mailing list
CentOS-virt@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos-virt

    
_______________________________________________
CentOS-virt mailing list
CentOS-virt@xxxxxxxxxx
https://lists.centos.org/mailman/listinfo/centos-virt