On Thu, Jun 13, 2019 at 05:21:12PM -0400, Kent Overstreet wrote: > On Thu, Jun 13, 2019 at 03:13:40PM -0600, Andreas Dilger wrote: > > There are definitely workloads that require multiple threads doing non-overlapping > > writes to a single file in HPC. This is becoming an increasingly common problem > > as the number of cores on a single client increase, since there is typically one > > thread per core trying to write to a shared file. Using multiple files (one per > > core) is possible, but that has file management issues for users when there are a > > million cores running on the same job/file (obviously not on the same client node) > > dumping data every hour. > > Mixed buffered and O_DIRECT though? That profile looks like just buffered IO to > me. > > > We were just looking at this exact problem last week, and most of the threads are > > spinning in grab_cache_page_nowait->add_to_page_cache_lru() and set_page_dirty() > > when writing at 1.9GB/s when they could be writing at 5.8GB/s (when threads are > > writing O_DIRECT instead of buffered). Flame graph is attached for 16-thread case, > > but high-end systems today easily have 2-4x that many cores. > > Yeah I've been spending some time on buffered IO performance too - 4k page > overhead is a killer. > > bcachefs has a buffered write path that looks up multiple pages at a time and > locks them, and then copies the data to all the pages at once (I stole the idea > from btrfs). It was a very significant performance increase. Careful with that - locking multiple pages is also a deadlock vector that triggers unexpectedly when something conspires to lock pages in non-ascending order. e.g. 64081362e8ff mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock The fs/iomap.c code avoids this problem by mapping the IO first, then iterating pages one at a time until the mapping is consumed, then it gets another mapping. It also avoids needing to put a page array on stack.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx