Hi All, We are in process of setting up a new cloud infrastructure and we are deciding if we should use file-backed virtual machines or LVM-volume-backed virtual machines and I would like to kindly ask community to confirm some of our performance related findings and/or advice if there is something that can be done about it. I heard that performance difference between LVM volumes and files on filesystem (when using for VM disks) is only about 1-5% but this is not what we are seeing. Regarding our test hypervisor - it is filled with ssd disks each capable of up to 130 000 write IOPS, has plenty of CPU and RAM. We test performance by running fio inside virtual machines (KVM based) hosted on this hypervisor. In order to achieve comparable and consistent benchmark results, virtual machines are single core VMs and CPU hyperthreading is turned off on the hypervisor. Furthermore CPU cores are dedicated for the virtual machines using the cpu pinning (so particular VM runs only on particular CPU core). here is the fio command: fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=3G --numjobs=1 --readwrite=randwrite Kernel version: 3.16.0-60-generic We mount the filesystem with: mount -o noatime,data=writeback,barrier=0 /dev/md127 /var/lib/nova/instances/ We also disable journaling for the testing purposes. Here are results if we use LVM volume as the Virtual machine disk versus if we use file (qcow2 or raw) stored on EXT4 filesystem for the VM disk. 1 Test: Maximum sustainable write IOPS achieved on single VM: LVM: ~ 16 000 IOPS EXT4: ~ 4 000 IOPS 2 Test: Maximum sustainable write IOPS achieved on hypervisor by running multiple test VMs: LVM: ~ 40 000 IOPS (and then MDRAID5 hit 100% CPU utilization) EXT4: ~ 20 000 IOPS So basically LVM seems to perform much better. Note that in the second test the raid started to be bottleneck so it is possible that LVM layer would be capable of even more on faster raid. In the Test 1: - on LVM we hit 100% utilization of the qemu VM process where we had: usr:50%,sys50%,wait:0% - on EXT4 we hit 100% utilization of the qemu VM process where we hed: usr:30%,sys:30%,wait30% So it seem that performance of EXT4 is significantly lower and when using EXT4 we saw significant wait time. I tried to look at it a bit (using some custom Systemtap script) and here is my observartion. When checking what is going on on the CPU which is executing the KVM qemu process of the test VM it seems it is executing 2 main threads (these 2 threads are responsible for most of the time spent on CPU) and about 60-70 other threads which I assume are some filesystem workers. Out of the 2 main threads one of it seems OK and doesn't seems to be waiting for lock or anything - most of the time I see this thread leaving the CPU it is normal scheduler interrupt. The other main thread is actually spending a lot of time waiting for lock here, basically when this thread is leaving CPU it often does so here: TID: 9838 waited 4916317 ns here: 0xffffffffc1e4f12b : 0xffffffffc1e4f12b [stap_b9c4a8366b974feec4893d4b5949417_17490+0x912b/0x0] 0xffffffffc1e5061b : 0xffffffffc1e5061b [stap_b9c4a8366b974feec4893d4b5949417_17490+0xa61b/0x0] 0xffffffffc1e51e7a : 0xffffffffc1e51e7a [stap_b9c4a8366b974feec4893d4b5949417_17490+0xbe7a/0x0] 0xffffffffc1e46014 : 0xffffffffc1e46014 [stap_b9c4a8366b974feec4893d4b5949417_17490+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] 0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm] 0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel] 0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm] 0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm] 0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel] 0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel] 0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel] 0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm] 0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] When looking at the 60 worker threads, they often leave CPU in following places and wait there some bigger amount of CPU cycles: First place: TID: 57936 waited 2092552 ns here: 0xffffffffc18a815b : 0xffffffffc18a815b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0] 0xffffffffc18a965b : 0xffffffffc18a965b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0] 0xffffffffc18ab09a : 0xffffffffc18ab09a [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0] 0xffffffffc189f014 : 0xffffffffc189f014 [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176d839 : schedule+0x29/0x70 [kernel] 0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel] 0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel] 0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel] 0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] Second place: TID: 57937 waited 1542013 ns here: 0xffffffffc18a815b : 0xffffffffc18a815b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x915b/0x0] 0xffffffffc18a965b : 0xffffffffc18a965b [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xa65b/0x0] 0xffffffffc18ab09a : 0xffffffffc18ab09a [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0xc09a/0x0] 0xffffffffc189f014 : 0xffffffffc189f014 [stap_140d3798c84d55c6fb0e060c1ecb741_57914+0x14/0x0] 0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel] 0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel] 0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel] 0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel] 0xffffffff81250ca9 : ext4_file_write_iter+0x79/0x3a0 [kernel] 0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel] 0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel] 0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel] 0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel] So basically we attribute the lower EXT4 performance to these points where things need to be synchronized using locks but this is just what we see at high level so I would be curious if dev community thinks this might be the cause. All in all I'd like to ask following questions: 1) Are the benchmark results as you would expect? 2) Can the lower performance be attributed to the locking? 2) Is there something we could do to improve performance of the filesystem? 3) Are there any plans for development in this area? Regards, Premysl Kouril -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html