Hi Sanidhya, It's a very interesting paper to me. Thank you for sharing that. Looking at a glance, I have a question about F2FS where the paper concludes that F2FS serializes every writes. But, I don't agree to that, since cp_rwsem, a rw_semaphore, is used to cease from all the fs operations to perform a checkpoint. Other than that case, every operations including writes just grab read_sem, so there should be no serialization. It seems there is no sync/fsync contention in the workloads. Thanks, On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote: > We did quite extensive performance evaluation on file systems, > including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core > scalability using micro-benchmarks and application benchmarks. > > Your workload, i.e., multiple tasks are concurrently overwriting a > single file, whose file system blocks are previously written, is quite > similar to one of our benchmark. > > Based on our analysis, none of the file systems supports concurrent > update of a file even when each task accesses different region of > a file. That is because all file systems hold a lock for an entire > file. Only one exception is the concurrent direct I/O of XFS. > > I think that local file systems need to support the range-based > locking, which is common in parallel file systems, to improve > concurrency level of I/O operations, specifically write operations. > > If you can split a single file image into multiple files, you can > increase the concurrency level of write operations a little bit. > > For more details, please take a look at our paper draft: > https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf > > Though our paper is in review, I think it is okay to share since > the review process is single-blinded. You can find our analysis on > overwrite operations at Section 5.1.2. Scalability behavior of current > file systems are summarized at Section 7. > > On Fri, Feb 12, 2016 at 9:15 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Fri, Feb 12, 2016 at 06:38:47PM +0100, Premysl Kouril wrote: > >> > All of this being said, what are you trying to do? If you are happy > >> > using LVM, feel free to use it. If there are specific features that > >> > you want out of the file system, it's best that you explicitly > >> > identify what you want, and so we can minimize the cost of the > >> > features of what you want. > >> > >> > >> We are trying to decide whether to use filesystem or LVM for VM > >> storage. It's not that we are happy with LVM - while it performs > >> better there are limitations on LVM side especially when it comes to > >> manageability (for example certain features in OpenStack do only fork > >> if VM is file-based). > >> > >> So, in short, if we would make filesystem to perform better we would > >> rather use filesystem than LVM, (and we don't really have any special > >> requirements in terms of filesystem features). > >> > >> And in order for us to make a good decision I wanted to ask community, > >> if our observations and resultant numbers make sense. > > > > For ext4, this is what you are going to get. > > > > How about you try XFS? After all, concurrent direct IO writes is > > something it is rather good at. > > > > i.e. use XFS in both your host and guest. Use raw image files on the > > host, and to make things roughly even with LVM you'll want to > > preallocate them. If you don't want to preallocate them (i.e. sparse > > image files) set them up with an extent size hint of at least 1MB so > > that it limits fragmentation of the image file. Then configure qemu > > to use cache=none for it's IO to the image file. > > > > On the first write pass to the image file (in either case), you > > should see ~70-80% of the native underlying device performance > > because there is some overhead in either allocation (sparse image > > file) or unwritten extent conversion (preallocated image file). > > This, of course, asssumes you are not CPU limited in the QEMU > > process by the addition CPU overhead of file block mapping in the > > host filesystem vs raw block device IO. > > > > On the second write pass you should see 98-99% of the native > > underlying device performance (again with the assumption that CPU > > overhead of the host filesystem isn't a limiting factor). > > > > As an example, I have a block device that can sustain just under 36k > > random 4k write IOPS on my host. I have an XFS filesystem (default > > configs) on that 400GB block device. I created a sparse 500TB image > > file using: > > > > # xfs_io -f -c "extsize 1m" -c "truncate 500t" vm-500t.img > > > > And push it into a 16p/16GB RAM guest via: > > > > -drive file=/mnt/fast-ssd/vm-500t.img,if=virtio,cache=none,format=raw > > > > and in the guest run mkfs.xfs with defaults and mount it with > > defaults. Then I ran your fio test on that 5 times in a row: > > > > write: io=3072.0MB, bw=106393KB/s, iops=26598, runt= 29567msec > > write: io=3072.0MB, bw=141508KB/s, iops=35377, runt= 22230msec > > write: io=3072.0MB, bw=141254KB/s, iops=35313, runt= 22270msec > > write: io=3072.0MB, bw=141115KB/s, iops=35278, runt= 22292msec > > write: io=3072.0MB, bw=141534KB/s, iops=35383, runt= 22226msec > > > > The first run was 26k IOPS, the rest were at 35k IOPS as they > > overwrite the same blocks in the image file. IOWs, first pass at 75% > > of device capability, the rest at > 98% of the host measured device > > capability. All tests reported the full io depth was being used in > > the guest: > > > > IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% > > > > The guest OS measured about 30% CPU usage for a single fio run at > > 35k IOPS: > > > > real 0m22.648s > > user 0m1.678s > > sys 0m8.175s > > > > However, the QEMU process on the host required 4 entire CPUs to > > sustain this IO load, roughly 50/50 user/system time. IOWs, a large > > amount of the CPU overhead on such workloads is on the host side in > > QEMU, not the guest. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html