Re: EXT4 vs LVM performance for VMs

Sanidhya Kashyap <sanidhya.gatech@xxxxxxxxx> · Sat, 13 Feb 2016 16:56:18 -0500

We did quite extensive performance evaluation on file systems,
including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core
scalability using micro-benchmarks and application benchmarks.

Your workload, i.e., multiple tasks are concurrently overwriting a
single file, whose file system blocks are previously written, is quite
similar to one of our benchmark.

Based on our analysis, none of the file systems supports concurrent
update of a file even when each task accesses different region of
a file. That is because all file systems hold a lock for an entire
file. Only one exception is the concurrent direct I/O of XFS.

I think that local file systems need to support the range-based
locking, which is common in parallel file systems, to improve
concurrency level of I/O operations, specifically write operations.

If you can split a single file image into multiple files, you can
increase the concurrency level of write operations a little bit.

For more details, please take a look at our paper draft:
  https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf

Though our paper is in review, I think it is okay to share since
the review process is single-blinded. You can find our analysis on
overwrite operations at Section 5.1.2. Scalability behavior of current
file systems are summarized at Section 7.

On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
>> > All of this being said, what are you trying to do?  If you are happy
>> > using LVM, feel free to use it.  If there are specific features that
>> > you want out of the file system, it's best that you explicitly
>> > identify what you want, and so we can minimize the cost of the
>> > features of what you want.
>>
>>
>> We are trying to decide whether to use filesystem or LVM for VM
>> storage. It's not that we are happy with LVM - while it performs
>> better there are limitations on LVM side especially when it comes to
>> manageability (for example certain features in OpenStack do only fork
>> if VM is file-based).
>>
>> So, in short, if we would make filesystem to perform better we would
>> rather use filesystem than LVM, (and we don't really have any special
>> requirements in terms of filesystem features).
>>
>> And in order for us to make a good decision I wanted to ask community,
>> if our observations and resultant numbers make sense.
>
> For ext4, this is what you are going to get.
>
> How about you try XFS? After all, concurrent direct IO writes is
> something it is rather good at.
>
> i.e. use XFS in both your host and guest. Use raw image files on the
> host, and to make things roughly even with LVM you'll want to
> preallocate them. If you don't want to preallocate them (i.e. sparse
> image files) set them up with an extent size hint of at least 1MB so
> that it limits fragmentation of the image file.  Then configure qemu
> to use cache=none for it's IO to the image file.
>
> On the first write pass to the image file (in either case), you
> should see ~70-80% of the native underlying device performance
> because there is some overhead in either allocation (sparse image
> file) or unwritten extent conversion (preallocated image file).
> This, of course, asssumes you are not CPU limited in the QEMU
> process by the addition CPU overhead of file block mapping in the
> host filesystem vs raw block device IO.
>
> On the second write pass you should see 98-99% of the native
> underlying device performance (again with the assumption that CPU
> overhead of the host filesystem isn't a limiting factor).
>
> As an example, I have a block device that can sustain just under 36k
> random 4k write IOPS on my host. I have an XFS filesystem (default
> configs) on that 400GB block device. I created a sparse 500TB image
> file using:
>
> # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
>
> And push it into a 16p/16GB RAM guest via:
>
> -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
>
> and in the guest run mkfs.xfs with defaults and mount it with
> defaults. Then I ran your fio test on that 5 times in a row:
>
> write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
>
> The first run was 26k IOPS, the rest were at 35k IOPS as they
> overwrite the same blocks in the image file. IOWs, first pass at 75%
> of device capability, the rest at > 98% of the host measured device
> capability. All tests reported the full io depth was being used in
> the guest:
>
> IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>
> The guest OS measured about 30% CPU usage for a single fio run at
> 35k IOPS:
>
> real    0m22.648s
> user    0m1.678s
> sys     0m8.175s
>
> However, the QEMU process on the host required 4 entire CPUs to
> sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> amount of the CPU overhead on such workloads is on the host side in
> QEMU, not the guest.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html