Re: EXT4 vs LVM performance for VMs

Liu Bo <bo.li.liu@xxxxxxxxxx> · Mon, 15 Feb 2016 11:11:10 -0800

On Mon, Feb 15, 2016 at 07:56:05PM +0100, Premysl Kouril wrote:
> Hello Dave,
> 
> thanks for your suggestion. I've just recreated our tests with the XFS
> and preallocated raw files and the results seem almost same as with
> the EXT4. I again checked stuff with my Systemtap script and threads
> are waiting mostly waiting for locks in following placeS:
> 
> The main KVM thread:
> 
> 
> TID: 4532 waited 5135787 ns here:
>  0xffffffffc18a815b : 0xffffffffc18a815b
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x915b/0x0]
>  0xffffffffc18a964b : 0xffffffffc18a964b
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xa64b/0x0]
>  0xffffffffc18ab07a : 0xffffffffc18ab07a
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0xc07a/0x0]
>  0xffffffffc189f014 : 0xffffffffc189f014
> [stap_eb6b67472fc672bbb457a915ab0fb97_10634+0x14/0x0]
>  0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
>  0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
>  0xffffffffc0857c2d : 0xffffffffc0857c2d [kvm]
>  0xffffffff810b6250 : autoremove_wake_function+0x0/0x40 [kernel]
>  0xffffffffc0870f41 : 0xffffffffc0870f41 [kvm]
>  0xffffffffc085ace2 : 0xffffffffc085ace2 [kvm]
>  0xffffffff812132d8 : fsnotify+0x228/0x2f0 [kernel]
>  0xffffffff810e7e9a : do_futex+0x10a/0x6a0 [kernel]
>  0xffffffff811e8890 : do_vfs_ioctl+0x2e0/0x4c0 [kernel]
>  0xffffffffc0864ce4 : 0xffffffffc0864ce4 [kvm]
>  0xffffffff811e8af1 : sys_ioctl+0x81/0xa0 [kernel]
>  0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]
> 
> 
> 
> Worker threads (KVM worker or kernel worker):
> 
> 
> TID: 12139 waited 7939986 here:
>  0xffffffffc1e4f15b : 0xffffffffc1e4f15b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
>  0xffffffffc1e5065b : 0xffffffffc1e5065b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
>  0xffffffffc1e5209a : 0xffffffffc1e5209a
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
>  0xffffffffc1e46014 : 0xffffffffc1e46014
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
>  0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
>  0xffffffff8176d839 : schedule+0x29/0x70 [kernel]
>  0xffffffff810e511e : futex_wait_queue_me+0xde/0x140 [kernel]
>  0xffffffff810e5c42 : futex_wait+0x182/0x290 [kernel]
>  0xffffffff810e7e6e : do_futex+0xde/0x6a0 [kernel]
>  0xffffffff810e84a1 : SyS_futex+0x71/0x150 [kernel]
>  0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]
> 
> 
> 
> TID: 12139 waited 11219902 here:
>  0xffffffffc1e4f15b : 0xffffffffc1e4f15b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x915b/0x0]
>  0xffffffffc1e5065b : 0xffffffffc1e5065b
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xa65b/0x0]
>  0xffffffffc1e5209a : 0xffffffffc1e5209a
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0xc09a/0x0]
>  0xffffffffc1e46014 : 0xffffffffc1e46014
> [stap_18a0b929d6dbf8774f2b8457b465301_12541+0x14/0x0]
>  0xffffffff8176d450 : __schedule+0x3e0/0x7a0 [kernel]
>  0xffffffff8176dd49 : schedule_preempt_disabled+0x29/0x70 [kernel]
>  0xffffffff8176fb95 : __mutex_lock_slowpath+0xd5/0x1d0 [kernel]
>  0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
>  0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
>  0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
>  0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]
>  0xffffffff811d5251 : new_sync_write+0x81/0xb0 [kernel]
>  0xffffffff811d5a07 : vfs_write+0xb7/0x1f0 [kernel]
>  0xffffffff811d6732 : sys_pwrite64+0x72/0xb0 [kernel]
>  0xffffffff817718cd : system_call_fastpath+0x1a/0x1f [kernel]
> 
> 
> 
> 
> Looking at this, does it suggest that the bottleneck is locking on the
> VFS layer? Or does my setup actually do DirectIO on the host level?
> You and Sanidhya mentioned that XFS is good at concurrent DirectIO as
> it doesn't hold lock on file, but I do see this in the trace:
> 
>  0xffffffff8176fcaf : mutex_lock+0x1f/0x2f [kernel]
>  0xffffffffc073c531 : 0xffffffffc073c531 [xfs]
>  0xffffffffc07a788a : 0xffffffffc07a788a [xfs]
>  0xffffffffc073dd4c : 0xffffffffc073dd4c [xfs]
> 
> So either KVM is not doing directIO or there is some lock xfs must
> hold to do the write, right?

Is this gathered when qemu is binded to single CPU?

fio takes iodepth=64, but blk-mq uses per-cpu or per-node queue.

Not sure if blk-mq is available on 3.16.0.

Thanks,

-liubo

> 
> 
> Regards,
> Premysl Kouril
> 
> 
> 
> 
> 
> 
> 
> On Sat, Feb 13, 2016 at 3:15 AM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
> >> > All of this being said, what are you trying to do?  If you are happy
> >> > using LVM, feel free to use it.  If there are specific features that
> >> > you want out of the file system, it's best that you explicitly
> >> > identify what you want, and so we can minimize the cost of the
> >> > features of what you want.
> >>
> >>
> >> We are trying to decide whether to use filesystem or LVM for VM
> >> storage. It's not that we are happy with LVM - while it performs
> >> better there are limitations on LVM side especially when it comes to
> >> manageability (for example certain features in OpenStack do only fork
> >> if VM is file-based).
> >>
> >> So, in short, if we would make filesystem to perform better we would
> >> rather use filesystem than LVM, (and we don't really have any special
> >> requirements in terms of filesystem features).
> >>
> >> And in order for us to make a good decision I wanted to ask community,
> >> if our observations and resultant numbers make sense.
> >
> > For ext4, this is what you are going to get.
> >
> > How about you try XFS? After all, concurrent direct IO writes is
> > something it is rather good at.
> >
> > i.e. use XFS in both your host and guest. Use raw image files on the
> > host, and to make things roughly even with LVM you'll want to
> > preallocate them. If you don't want to preallocate them (i.e. sparse
> > image files) set them up with an extent size hint of at least 1MB so
> > that it limits fragmentation of the image file.  Then configure qemu
> > to use cache=none for it's IO to the image file.
> >
> > On the first write pass to the image file (in either case), you
> > should see ~70-80% of the native underlying device performance
> > because there is some overhead in either allocation (sparse image
> > file) or unwritten extent conversion (preallocated image file).
> > This, of course, asssumes you are not CPU limited in the QEMU
> > process by the addition CPU overhead of file block mapping in the
> > host filesystem vs raw block device IO.
> >
> > On the second write pass you should see 98-99% of the native
> > underlying device performance (again with the assumption that CPU
> > overhead of the host filesystem isn't a limiting factor).
> >
> > As an example, I have a block device that can sustain just under 36k
> > random 4k write IOPS on my host. I have an XFS filesystem (default
> > configs) on that 400GB block device. I created a sparse 500TB image
> > file using:
> >
> > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
> >
> > And push it into a 16p/16GB RAM guest via:
> >
> > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
> >
> > and in the guest run mkfs.xfs with defaults and mount it with
> > defaults. Then I ran your fio test on that 5 times in a row:
> >
> > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
> >
> > The first run was 26k IOPS, the rest were at 35k IOPS as they
> > overwrite the same blocks in the image file. IOWs, first pass at 75%
> > of device capability, the rest at > 98% of the host measured device
> > capability. All tests reported the full io depth was being used in
> > the guest:
> >
> > IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >
> > The guest OS measured about 30% CPU usage for a single fio run at
> > 35k IOPS:
> >
> > real    0m22.648s
> > user    0m1.678s
> > sys     0m8.175s
> >
> > However, the QEMU process on the host required 4 entire CPUs to
> > sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> > amount of the CPU overhead on such workloads is on the host side in
> > QEMU, not the guest.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html