On Sat, Feb 13, 2016 at 04:56:18PM -0500, Sanidhya Kashyap wrote: > We did quite extensive performance evaluation on file systems, > including ext4, XFS, btrfs, F2FS, and tmpfs, in terms of multi-core > scalability using micro-benchmarks and application benchmarks. > > Your workload, i.e., multiple tasks are concurrently overwriting a > single file, whose file system blocks are previously written, is quite > similar to one of our benchmark. > > Based on our analysis, none of the file systems supports concurrent > update of a file even when each task accesses different region of > a file. That is because all file systems hold a lock for an entire > file. Only one exception is the concurrent direct I/O of XFS. > > I think that local file systems need to support the range-based > locking, which is common in parallel file systems, to improve > concurrency level of I/O operations, specifically write operations. Yes, we've spent a fair bit of time talking about that (pretty sure it was a topic of discussion at last year's LFSMM developer conference), but it really isn't a simply thing to add to the VFS or most filesystems. > If you can split a single file image into multiple files, you can > increase the concurrency level of write operations a little bit. At the cost of increased storage stack complexity. most people don't need extreme performance in their VMs, so a single file is generally adequate on XFS. > For more details, please take a look at our paper draft: > https://sslab.gtisc.gatech.edu/assets/papers/2016/min:fxmark-draft.pdf > > Though our paper is in review, I think it is okay to share since > the review process is single-blinded. You can find our analysis on > overwrite operations at Section 5.1.2. Scalability behavior of current > file systems are summarized at Section 7. It's a nice summary of the issues, but there are no surprises in the paper. i.e. It's all things we already know about and, in some cases, are already looking at solutions (e.g. per-node/per-cpu lists to address inode_sb_list_lock contention, potential for converting i_mutex to an rwsem to allow shared read-only access to directories, etc). The only thing that surprised me is how badly rwsems degrade when contended on large machines. I've done local benchmarks on 16p machines with single file direct IO and pushed to being CPU bound I've measured over 2 million single sector random read IOPS, 1.5 million random overwrite IOPS, and ~800k random write w/ allocate IOPS. IOWs, the IO scalability is there when the lock doesn't degrade (which really is a core OS issue, not so much a fs issue). A couple of things I noticed in the summary: "High locality can cause performance collapse" You imply filesystems try to maintain high locality to improve cache hit rates. Filesystems try to maintain locality in disk allocation to minimise seek time for physical IO on related structures to maintain good performance when /cache misses occur/. IOWs, the scalability of the in-memory caches is completely unrelated to the "high locality" optimisations that filesystem make... "because XFS holds a per-device lock instead of a per-file lock in an O_DIRECT mode" That's a new one - I've never heard anyone say that about XFS (and I've heard a lot of wacky things about XFS!). It's much simpler than that - we don't use the i_mutex in O_DIRECT mode, and instead uses shared read locking on the per-inode IO lock for all IO operations. "Overwriting is as expensive as appending" You shouldn't make generalisations that don't apply generally to the the filesystems you tested. :P FWIW, log->l_icloglock contention in XFS implies the application has an excessive fsync problem - that's the only way that lock can see any sort of significant concurrent access. It's probably just the case that the old-school algorithm the code uses to wait for journal IO completion was never expected to scale to operations on storage that can sustain millions of IOPS. I'll add it to the list of known journalling scalabiity bottlenecks in XFS - there's a lot more issues than your testing has told you about.... :/ Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html