Hello, I would really appreciate some help/guidance with this problem. First of all sorry for the long message. I would file a bug, but do not know if it is my fault, dm-cache, qemu or (probably) a combination of both. And i can imagine some of you have this setup up and running without problems (or maybe you think it works, just like i did, but it does not): PROBLEM DESCRIPTION When using software RAID 1 (2x HDD) + software RAID 1 (2xSSD) and create a cached LV out of them, the VM performs initially great (at least 40.000 IOPS on 4k rand read/write)! But then after a while (and a lot of random IO, ca 10 - 20 G) it effectively turns in to a writethrough cache although there's much space left on the cachedlv.
When working as expected on KVM host all writes go to SSDs iostat -x -m 2
When not working as expected on KVM host all writes go through
the SSD on to the HDDs (effectively disabling writeback so it
becomes a writethrough) Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
Stuff i've checked/tried: - The data in the cached LV has then not exceeded even half of
the space, so this should not happen. It even happens when only
20% of cachedata is used. lvchange --cachepolicy cleaner /dev/mapper/XXX-cachedlv lvs -a lvchange --cachepolicy smq /dev/mapper/XXX-cachedlv - tried disabling Selinux, upgrading to newest kernels (elrepo ml
and lt), played around with dirty_cache thingeys like
proc/sys/vm/dirty_writeback_centisecs
/proc/sys/vm/dirty_expire_centisecs cat /proc/sys/vm/dirty_ratio ,
and migration threashold of dmsetup, and other probably non
important stuff like
vm.dirty_bytes - when in "slow state" the systems kworkers are exessively using IO (10 - 20 MB per kworker process). This seems to be the writeback process (CPY%Sync) because the cache wants to flush to HDD. But the strange thing is that after a good sync (0% left), the disk may become slow again after a few MBs of data. A reboot sometimes helps. - have tried iothreads, virtio-scsi, vcpu driver setting on virtio-scsi controller, cachesettings, disk shedulers etc. Nothing helped. - the new samsung 950 PRO SSDs have HPA enabled (30%!!), i have
AMD FX(tm)-8350, 16G RAM It feels like the lvm cache has a threshold (about 20G of data
that is dirty) and that is stops allowing the qemu-kvm process to
use writeback caching (the root uses inside the host seems to not
have this limitation). It starts flushing, but only to a certain
point. After a few MBs of data it is right back in the slow spot
again. Only solution is waiting for a long time (independant of
CPY%SYNC) or sometimes change cachepolicy and force flush. This
prevents for me the production use of this system. But it's so
promising, so I hope somebody can help. desired state: Doing the FIO test (described in section reproduce) repeatedly should keep being fast till cachedlv is more or less full. If resyncing back to disc causes this degradation, it should actually flush it fully within a reasonable time and give opportunity to write fast again up to a given threshold. It now seems like a one time use cache that only uses a fraction of the SSD and is useless/very unstable afterwards. REPRODUCE 2. once installed and booted in to the system, install qemu-kvm yum install -y centos-release-qemu-ev 3. create LVM cache #set some variables and create a raid1 array with the two SSDs VGBASE= && ssddevice1=/dev/sdX1 && ssddevice2=/dev/sdX1 && hddraiddevice=/dev/mdXXX && ssdraiddevice=/dev/mdXXX && mdadm --create --verbose ${ssdraiddevice} --level=mirror --bitmap=none --raid-devices=2 ${ssddevice1} ${ssddevice2} # create PV and extend VG pvcreate ${ssdraiddevice} && vgextend ${VGBASE}
${ssdraiddevice} # create slow LV on HDDs (use max space left if you want) pvdisplay ${hddraiddevice} # create the meta and data: for testing purposes I keep about 20G
of the SSD for a uncached lv. To rule out it is not the SSD. lvcreate -l XX -n testssd ${VGBASE} ${ssdraiddevice} #The rest can be used as cachedata/metadata. pvdisplay ${ssdraiddevice} # convert/combine pools so cachedlv is actually cached lvconvert --type cache-pool --cachemode writeback --poolmetadata ${VGBASE}/cachemeta ${VGBASE}/cachedata lvconvert --type cache --cachepool ${VGBASE}/cachedata ${VGBASE}/cachedlv
[root@localhost ~]# lvs -a LV VG Attr LSize Pool Origin [cachedata] cl Cwi---C--- 97.66g [cachedata_cdata] cl Cwi-ao---- 97.66g [cachedata_cmeta] cl ewi-ao---- 100.00m cachedlv cl Cwi-aoC--- 1.75t [cachedata] [cachedlv_corig] [cachedlv_corig] cl owi-aoC--- 1.75t [lvol0_pmspare] cl ewi------- 100.00m root cl -wi-ao---- 46.56g swap cl -wi-ao---- 14.96g testssd cl -wi-a----- 45.47g [root@localhost ~]#lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdd 8:48 0 163G 0 disk └─sdd1 8:49 0 163G 0 part └─md128 9:128 0 162.9G 0 raid1 ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm │ └─cl-cachedlv 253:6 0 1.8T 0 lvm ├─cl-testssd 253:2 0 45.5G 0 lvm └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm └─cl-cachedlv 253:6 0 1.8T 0 lvm sdb 8:16 0 1.8T 0 disk ├─sdb2 8:18 0 1.8T 0 part │ └─md127 9:127 0 1.8T 0 raid1 │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP] │ ├─cl-root 253:0 0 46.6G 0 lvm / │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm │ └─cl-cachedlv 253:6 0 1.8T 0 lvm └─sdb1 8:17 0 954M 0 part └─md126 9:126 0 954M 0 raid1 /boot sdc 8:32 0 163G 0 disk └─sdc1 8:33 0 163G 0 part └─md128 9:128 0 162.9G 0 raid1 ├─cl-cachedata_cmeta 253:4 0 100M 0 lvm │ └─cl-cachedlv 253:6 0 1.8T 0 lvm ├─cl-testssd 253:2 0 45.5G 0 lvm └─cl-cachedata_cdata 253:3 0 97.7G 0 lvm └─cl-cachedlv 253:6 0 1.8T 0 lvm sda 8:0 0 1.8T 0 disk ├─sda2 8:2 0 1.8T 0 part │ └─md127 9:127 0 1.8T 0 raid1 │ ├─cl-swap 253:1 0 15G 0 lvm [SWAP] │ ├─cl-root 253:0 0 46.6G 0 lvm / │ └─cl-cachedlv_corig 253:5 0 1.8T 0 lvm │ └─cl-cachedlv 253:6 0 1.8T 0 lvm └─sda1 8:1 0 954M 0 part └─md126 9:126 0 954M 0 raid1 /boot # now create vm wget http://ftp.tudelft.nl/centos.org/6/isos/x86_64/CentOS-6.9-x86_64-minimal.iso -P /home/ DISK=/dev/mapper/XXXX-cachedlv # watch out, my netsetup uses a custom bridge/network in the following command. Please replace with what you normally use. virt-install -n CentOS1 -r 12000 --os-variant=centos6.7 --vcpus 7 --disk path=${DISK},cache=none,bus=virtio --network bridge=pubbr,model=virtio --cdrom /home/CentOS-6.9-x86_64-minimal.iso --graphics vnc,port=5998,listen=0.0.0.0 --cpu host # now connect with client PC to qemu virt-viewer --connect=qemu+ssh://root@xxxxxxxxxxxxx/system --name CentOS1 And install everything on the single vda disc with LVM (i use defaults in anaconda, but remove the large /home to prevent SSD beeing over used). After install and reboot log in to VM and yum install epel-release -y && yum install screen fio htop -y and then run disk test: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 then keep repeating but change the filename attribute so it does not use the same blocks over and over again. In the beginning the performance is great!! Wow, very impressive 150MB/s 4k random r/w (close to bare metal, about 20% - 30% loss). But after a few (usually about 4 or 5) runs (always changing the filename, but not overfilling the FS, it drops to about 10 MBs/sec. normal/in the beginning read : io=3073.2MB, bw=183085KB/s, iops=45771 , runt= 17188msec write: io=1022.1MB, bw=60940KB/s, iops=15235 , runt= 17188msec but then read : io=3073.2MB, bw=183085KB/s, iops=2904 , runt= 17188msec write: io=1022.1MB, bw=60940KB/s, iops=1751 , runt= 17188msec or even worse up to the point that it is actually the HDD that is written to (about 500 iops). P.S. when a test is/was slow, that means it is on the HDDs. So even when fixing the problem (sometimes just by waiting), that specific file will keep being slow when redoing the test till its promoted to the lvm cache (takes a lot of reads I think). And once on the SSD it sometimes keeps beeing fast, although a new testfile will be slow. So I really recommend changing the testfile all the time when trying to see if a change in speed has occurred. -- Met vriendelijke groet, Richard Landsman http://rimote.nl T: +31 (0)50 - 763 04 07 (ma-vr 9:00 tot 18:00) 24/7 bij storingen: +31 (0)6 - 4388 7949 @RimoteSaS (Twitter Serviceberichten/security updates) |
_______________________________________________ CentOS-virt mailing list CentOS-virt@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos-virt