Re: EXT4 vs LVM performance for VMs

Jaegeuk Kim <jaegeuk@xxxxxxxxxx> · Sat, 13 Feb 2016 15:40:53 -0800

Hi Sanidhya,

It's a very interesting paper to me. Thank you for sharing that.

Looking at a glance, I have a question about F2FS where the paper concludes
that F2FS serializes every writes.
But, I don't agree to that, since cp_rwsem, a rw_semaphore, is used to cease
from all the fs operations to perform a checkpoint.
Other than that case, every operations including writes just grab read_sem,
so there should be no serialization.
It seems there is no sync/fsync contention in the workloads.

Thanks,

On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote:
> We did quite extensive performance evaluation on file systems,
> including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core
> scalability using micro-benchmarks and application benchmarks.
> 
> Your workload, i.e., multiple tasks are concurrently overwriting a
> single file, whose file system blocks are previously written, is quite
> similar to one of our benchmark.
> 
> Based on our analysis, none of the file systems supports concurrent
> update of a file even when each task accesses different region of
> a file. That is because all file systems hold a lock for an entire
> file. Only one exception is the concurrent direct I/O of XFS.
> 
> I think that local file systems need to support the range-based
> locking, which is common in parallel file systems, to improve
> concurrency level of I/O operations, specifically write operations.
> 
> If you can split a single file image into multiple files, you can
> increase the concurrency level of write operations a little bit.
> 
> For more details, please take a look at our paper draft:
>   https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf
> 
> Though our paper is in review, I think it is okay to share since
> the review process is single-blinded. You can find our analysis on
> overwrite operations at Section 5.1.2. Scalability behavior of current
> file systems are summarized at Section 7.
> 
> On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote:
> >> > All of this being said, what are you trying to do?  If you are happy
> >> > using LVM, feel free to use it.  If there are specific features that
> >> > you want out of the file system, it's best that you explicitly
> >> > identify what you want, and so we can minimize the cost of the
> >> > features of what you want.
> >>
> >>
> >> We are trying to decide whether to use filesystem or LVM for VM
> >> storage. It's not that we are happy with LVM - while it performs
> >> better there are limitations on LVM side especially when it comes to
> >> manageability (for example certain features in OpenStack do only fork
> >> if VM is file-based).
> >>
> >> So, in short, if we would make filesystem to perform better we would
> >> rather use filesystem than LVM, (and we don't really have any special
> >> requirements in terms of filesystem features).
> >>
> >> And in order for us to make a good decision I wanted to ask community,
> >> if our observations and resultant numbers make sense.
> >
> > For ext4, this is what you are going to get.
> >
> > How about you try XFS? After all, concurrent direct IO writes is
> > something it is rather good at.
> >
> > i.e. use XFS in both your host and guest. Use raw image files on the
> > host, and to make things roughly even with LVM you'll want to
> > preallocate them. If you don't want to preallocate them (i.e. sparse
> > image files) set them up with an extent size hint of at least 1MB so
> > that it limits fragmentation of the image file.  Then configure qemu
> > to use cache=none for it's IO to the image file.
> >
> > On the first write pass to the image file (in either case), you
> > should see ~70-80% of the native underlying device performance
> > because there is some overhead in either allocation (sparse image
> > file) or unwritten extent conversion (preallocated image file).
> > This, of course, asssumes you are not CPU limited in the QEMU
> > process by the addition CPU overhead of file block mapping in the
> > host filesystem vs raw block device IO.
> >
> > On the second write pass you should see 98-99% of the native
> > underlying device performance (again with the assumption that CPU
> > overhead of the host filesystem isn't a limiting factor).
> >
> > As an example, I have a block device that can sustain just under 36k
> > random 4k write IOPS on my host. I have an XFS filesystem (default
> > configs) on that 400GB block device. I created a sparse 500TB image
> > file using:
> >
> > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img
> >
> > And push it into a 16p/16GB RAM guest via:
> >
> > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw
> >
> > and in the guest run mkfs.xfs with defaults and mount it with
> > defaults. Then I ran your fio test on that 5 times in a row:
> >
> > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec
> > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec
> > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec
> > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec
> > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec
> >
> > The first run was 26k IOPS, the rest were at 35k IOPS as they
> > overwrite the same blocks in the image file. IOWs, first pass at 75%
> > of device capability, the rest at > 98% of the host measured device
> > capability. All tests reported the full io depth was being used in
> > the guest:
> >
> > IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
> >
> > The guest OS measured about 30% CPU usage for a single fio run at
> > 35k IOPS:
> >
> > real    0m22.648s
> > user    0m1.678s
> > sys     0m8.175s
> >
> > However, the QEMU process on the host required 4 entire CPUs to
> > sustain this IO load, roughly 50/50 user/system time. IOWs, a large
> > amount of the CPU overhead on such workloads is on the host side in
> > QEMU, not the guest.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html