On Fri 17-06-22 17:48:08, Amir Goldstein wrote: > On Tue, Apr 9, 2019 at 11:26 AM Jan Kara <jack@xxxxxxx> wrote: > > > > On Mon 08-04-19 20:41:09, Amir Goldstein wrote: > > > On Mon, Apr 8, 2019 at 5:11 PM Jan Kara <jack@xxxxxxx> wrote: > > > > > > > > On Mon 08-04-19 12:02:34, Amir Goldstein wrote: > > > > > On Mon, Apr 8, 2019 at 2:27 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Fri, Apr 05, 2019 at 05:02:33PM +0300, Amir Goldstein wrote: > > > > > > > On Fri, Apr 5, 2019 at 12:17 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > > > On Thu, Apr 04, 2019 at 07:57:37PM +0300, Amir Goldstein wrote: > > > > > > > > > This patch improves performance of mixed random rw workload > > > > > > > > > on xfs without relaxing the atomic buffered read/write guaranty > > > > > > > > > that xfs has always provided. > > > > > > > > > > > > > > > > > > We achieve that by calling generic_file_read_iter() twice. > > > > > > > > > Once with a discard iterator to warm up page cache before taking > > > > > > > > > the shared ilock and once again under shared ilock. > > > > > > > > > > > > > > > > This will race with thing like truncate, hole punching, etc that > > > > > > > > serialise IO and invalidate the page cache for data integrity > > > > > > > > reasons under the IOLOCK. These rely on there being no IO to the > > > > > > > > inode in progress at all to work correctly, which this patch > > > > > > > > violates. IOWs, while this is fast, it is not safe and so not a > > > > > > > > viable approach to solving the problem. > > > > > > > > > > > > > > > > > > > > > > This statement leaves me wondering, if ext4 does not takes > > > > > > > i_rwsem on generic_file_read_iter(), how does ext4 (or any other > > > > > > > fs for that matter) guaranty buffered read synchronization with > > > > > > > truncate, hole punching etc? > > > > > > > The answer in ext4 case is i_mmap_sem, which is read locked > > > > > > > in the page fault handler. > > > > > > > > > > > > Nope, the i_mmap_sem is for serialisation of /page faults/ against > > > > > > truncate, holepunching, etc. Completely irrelevant to the read() > > > > > > path. > > > > > > > > > > > > > > > > I'm at lost here. Why are page faults completely irrelevant to read() > > > > > path? Aren't full pages supposed to be faulted in on read() after > > > > > truncate_pagecache_range()? > > > > > > > > During read(2), pages are not "faulted in". Just look at > > > > what generic_file_buffered_read() does. It uses completely separate code to > > > > add page to page cache, trigger readahead, and possibly call ->readpage() to > > > > fill the page with data. "fault" path (handled by filemap_fault()) applies > > > > only to accesses from userspace to mmaps. > > > > > > > > > > Oh! thanks for fixing my blind spot. > > > So if you agree with Dave that ext4, and who knows what other fs, > > > are vulnerable to populating page cache with stale "uptodate" data, > > > > Not that many filesystems support punching holes but you're right. > > > > > then it seems to me that also xfs is vulnerable via readahead(2) and > > > posix_fadvise(). > > > > Yes, this is correct AFAICT. > > > > > Mind you, I recently added an fadvise f_op, so it could be used by > > > xfs to synchronize with IOLOCK. > > > > And yes, this should work. > > > > > Perhaps a better solution would be for truncate_pagecache_range() > > > to leave zeroed or Unwritten (i.e. lazy zeroed by read) pages in page > > > cache. When we have shared pages for files, these pages could be > > > deduped. > > > > No, I wouldn't really mess with sharing pages due to this. It would be hard > > to make that scale resonably and would be rather complex. We really need a > > proper and reasonably simple synchronization mechanism between operations > > removing blocks from inode and operations filling in page cache of the > > inode. Page lock was supposed to provide this but doesn't quite work > > because hole punching first remove pagecache pages and then go removing all > > blocks. > > > > So I agree with Dave that going for range lock is really the cleanest way > > forward here without causing big regressions for mixed rw workloads. I'm > > just thinking how to best do that without introducing lot of boilerplate > > code into each filesystem. > > Hi Jan, Dave, > > Trying to circle back to this after 3 years! > Seeing that there is no progress with range locks and > that the mixed rw workloads performance issue still very much exists. > > Is the situation now different than 3 years ago with invalidate_lock? Yes, I've implemented invalidate_lock exactly to fix the issues you've pointed out without regressing the mixed rw workloads (because invalidate_lock is taken in shared mode only for reads and usually not at all for writes). > Would my approach of pre-warm page cache before taking IOLOCK > be safe if page cache is pre-warmed with invalidate_lock held? Why would it be needed? But yes, with invalidate_lock you could presumably make that idea safe... Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR