On Tue, Apr 13, 2021 at 01:28:46PM +0200, Jan Kara wrote: > Currently, serializing operations such as page fault, read, or readahead > against hole punching is rather difficult. The basic race scheme is > like: > > fallocate(FALLOC_FL_PUNCH_HOLE) read / fault / .. > truncate_inode_pages_range() > <create pages in page > cache here> > <update fs block mapping and free blocks> > > Now the problem is in this way read / page fault / readahead can > instantiate pages in page cache with potentially stale data (if blocks > get quickly reused). Avoiding this race is not simple - page locks do > not work because we want to make sure there are *no* pages in given > range. inode->i_rwsem does not work because page fault happens under > mmap_sem which ranks below inode->i_rwsem. Also using it for reads makes > the performance for mixed read-write workloads suffer. > > So create a new rw_semaphore in the inode - i_mapping_sem - that > protects adding of pages to page cache for page faults / reads / > readahead. > > Signed-off-by: Jan Kara <jack@xxxxxxx> .... > diff --git a/mm/filemap.c b/mm/filemap.c > index bd7c50e060a9..bc82a7856d3e 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -77,7 +77,8 @@ > * ->i_pages lock > * > * ->i_rwsem > - * ->i_mmap_rwsem (truncate->unmap_mapping_range) > + * ->i_mapping_sem (acquired by fs in truncate path) > + * ->i_mmap_rwsem (truncate->unmap_mapping_range) This is now officially a confusing mess. We have inode->i_mapping that points to an address space, which has an internal i_mmap variable and an i_mmap_rwsem that serialises access to the i_mmap tree. Now we have a inode->i_mapping_sem (which is actually a rwsem, not a sem) that sorta serialises additions to the mapping tree (inode->i_mapping->i_pages) against truncate, but it does not serialise accesses to the rest of the inode->i_mapping structure itself despite the similarlities in naming. Then we have the inode_lock() API and the i_mmap_lock() API that wrap around the i_rwsem and i_mmap_rwsem, but there's no API for inode_mapping_lock()... THen we have the mmap_lock in the page fault path as well, which is also an rwsem despite the name, and that protects something completely different to the i_mmap and the i_mapping. IOWs, we have 4 layers of entwined structures and locking here that pretty much all have the same name but protect different things and not always the obvious thing the name suggests. This makes it really difficult to actually read the code and understand that the correct lock is being used in the correct place... > * > * ->mmap_lock > * ->i_mmap_rwsem > @@ -85,7 +86,8 @@ > * ->i_pages lock (arch-dependent flush_dcache_mmap_lock) > * > * ->mmap_lock > - * ->lock_page (access_process_vm) > + * ->i_mapping_sem (filemap_fault) > + * ->lock_page (filemap_fault, access_process_vm) > * > * ->i_rwsem (generic_perform_write) > * ->mmap_lock (fault_in_pages_readable->do_page_fault) > @@ -2276,16 +2278,28 @@ static int filemap_update_page(struct kiocb *iocb, > { > int error; > > + if (iocb->ki_flags & IOCB_NOWAIT) { > + if (!down_read_trylock(&mapping->host->i_mapping_sem)) > + return -EAGAIN; > + } else { > + down_read(&mapping->host->i_mapping_sem); > + } We really need a lock primitive for this. The number of times this exact lock pattern is being replicated all through the IO path is getting out of hand. static inline bool down_read_try_or_lock(struct rwsem *sem, bool try) { if (try) { if (!down_read_trylock(sem)) return false; } else { down_read(&mapping->host->i_mapping_sem); } return true; } and the callers become: if (!down_read_try_or_lock(sem, (iocb->ki_flags & IOCB_NOWAIT))) return -EAGAIN; We can do the same with mutex_try_or_lock(), down_try_or_lock(), etc and we don't need to rely on cargo cult knowledge to propagate this pattern anymore. Because I'm betting relatively few people actually know why the code is written this way because the only place it is documented is in an XFS commit message.... Doing this is a separate cleanup, though, and not something that needs to be done in this patchset. > index c5b0457415be..ac5bb50b3a4c 100644 > --- a/mm/readahead.c > +++ b/mm/readahead.c > @@ -192,6 +192,7 @@ void page_cache_ra_unbounded(struct readahead_control *ractl, > */ > unsigned int nofs = memalloc_nofs_save(); > > + down_read(&mapping->host->i_mapping_sem); > /* > * Preallocate as many pages as we will need. > */ I can't say I'm a great fan of having the mapping reach back up to the host to lock the host. THis seems the wrong way around to me given that most of the locking in the IO path is in "host locks mapping" and "mapping locks internal mapping structures" order... I also come back to the naming confusion here, in that when we look at this in long hand from the inode perspective, this chain actually looks like: lock(inode->i_mapping->inode->i_mapping_sem) i.e. the mapping is reaching back up outside it's scope to lock itself against other inode->i_mapping operations. Smells of layering violations to me. So, next question: should this truncate semanphore actually be part of the address space, not the inode? This patch is actually moving the page fault serialisation from the inode into the address space operations when page faults and page cache operations are done, so maybe the lock should also make that move? That would help clear up the naming problem, because now we can name it based around what it serialises in the address space, not the address space as a whole... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx