We did quite extensive performance evaluation on file systems, including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core scalability using micro-benchmarks and application benchmarks. Your workload, i.e., multiple tasks are concurrently overwriting a single file, whose file system blocks are previously written, is quite similar to one of our benchmark. Based on our analysis, none of the file systems supports concurrent update of a file even when each task accesses different region of a file. That is because all file systems hold a lock for an entire file. Only one exception is the concurrent direct I/O of XFS. I think that local file systems need to support the range-based locking, which is common in parallel file systems, to improve concurrency level of I/O operations, specifically write operations. If you can split a single file image into multiple files, you can increase the concurrency level of write operations a little bit. For more details, please take a look at our paper draft: https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf Though our paper is in review, I think it is okay to share since the review process is single-blinded. You can find our analysis on overwrite operations at Section 5.1.2. Scalability behavior of current file systems are summarized at Section 7. On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: >> > All of this being said, what are you trying to do? If you are happy >> > using LVM, feel free to use it. If there are specific features that >> > you want out of the file system, it's best that you explicitly >> > identify what you want, and so we can minimize the cost of the >> > features of what you want. >> >> >> We are trying to decide whether to use filesystem or LVM for VM >> storage. It's not that we are happy with LVM - while it performs >> better there are limitations on LVM side especially when it comes to >> manageability (for example certain features in OpenStack do only fork >> if VM is file-based). >> >> So, in short, if we would make filesystem to perform better we would >> rather use filesystem than LVM, (and we don't really have any special >> requirements in terms of filesystem features). >> >> And in order for us to make a good decision I wanted to ask community, >> if our observations and resultant numbers make sense. > > For ext4, this is what you are going to get. > > How about you try XFS? After all, concurrent direct IO writes is > something it is rather good at. > > i.e. use XFS in both your host and guest. Use raw image files on the > host, and to make things roughly even with LVM you'll want to > preallocate them. If you don't want to preallocate them (i.e. sparse > image files) set them up with an extent size hint of at least 1MB so > that it limits fragmentation of the image file. Then configure qemu > to use cache=none for it's IO to the image file. > > On the first write pass to the image file (in either case), you > should see ~70-80% of the native underlying device performance > because there is some overhead in either allocation (sparse image > file) or unwritten extent conversion (preallocated image file). > This, of course, asssumes you are not CPU limited in the QEMU > process by the addition CPU overhead of file block mapping in the > host filesystem vs raw block device IO. > > On the second write pass you should see 98-99% of the native > underlying device performance (again with the assumption that CPU > overhead of the host filesystem isn't a limiting factor). > > As an example, I have a block device that can sustain just under 36k > random 4k write IOPS on my host. I have an XFS filesystem (default > configs) on that 400GB block device. I created a sparse 500TB image > file using: > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > And push it into a 16p/16GB RAM guest via: > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > and in the guest run mkfs.xfs with defaults and mount it with > defaults. Then I ran your fio test on that 5 times in a row: > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > The first run was 26k IOPS, the rest were at 35k IOPS as they > overwrite the same blocks in the image file. IOWs, first pass at 75% > of device capability, the rest at > 98% of the host measured device > capability. All tests reported the full io depth was being used in > the guest: > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > The guest OS measured about 30% CPU usage for a single fio run at > 35k IOPS: > > real 0m22.648s > user 0m1.678s > sys 0m8.175s > > However, the QEMU process on the host required 4 entire CPUs to > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > amount of the CPU overhead on such workloads is on the host side in > QEMU, not the guest. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html