On Thu, Sep 17, 2020 at 10:51:41AM -0700, Linus Torvalds wrote: > It does strike me that if the main source of contention comes from > that "we need to check that the mapping is still valid as we insert > the page into the page tables", then the page lock really isn't the > obvious lock to use. > > It would be much more natural to use the mapping->i_mmap_rwsem, I feel. > > Willy? Your "just check for uptodate without any lock" patch itself > feels wrong. That's what we do for plain reads, but the difference is > that a read is a one-time event and a race is fine: we get valid data, > it's just that it's only valid *concurrently* with the truncate or > hole-punching event (ie either all zeroes or old data is fine). > > The reason faulting a page in is different from a read is that if you > then map in a stale page, it might have had the correct contents at > the time of the fault, but it will not have the correct contents going > forward. > > So a page-in requires fundamentally stronger locking than a read() > does, because of how the page-in causes that "future lifetime" of the > page, in ways a read() event does not. Yes, I agree, mmap is granting future access to a page in a way that read is not. So we need to be sure that any concurrent truncate/hole-punch/collapse-range/invalidate-for-directio/migration (henceforth page-killer) doesn't allow a page that's about to be recycled to be added to the page tables. What I was going for was that every page-killer currently does something like this: if (page_mapped(page)) unmap_mapping_pages(mapping, page->index, nr, false); so as long as we mark the page as being no-new-maps-allowed before the page-killer checks page_mapped(), and the map-page path checks that the page isn't no-new-maps-allowed after incrementing page_mapped(), then we'll never see something we shouldn't in the page tables -- either it will show up in the page tables right before the page-killer calls unmap_mapping_pages() (in which case you'll get the old contents), or it won't show up in the page tables. I'd actually want to wrap all that into: static inline void page_kill_mappings(struct page) { ClearPageUptodate(page); smb_mb__before_atomic(); if (!page_mapped(page)) return; unmap_mapping_pages(mapping, page->index, compound_nr(page), false); } but that's just syntax. I'm pretty sure that patch I sent out doesn't handle page faults on disappearing pages correctly; it needs to retry so it can instantiate a new page in the page cache. And as Jan pointed out, it didn't handle the page migration case. But that wasn't really the point of the patch. > But truncation that does page cache removal already requires that > i_mmap_rwsem, and in fact the VM already very much uses that (ie when > walking the page mapping). > > The other alternative might be just the mapping->private_lock. It's > not a reader-writer lock, but if we don't need to sleep (and I don't > think the final "check ->mapping" can sleep anyway since it has to be > done together with the page table lock), a spinlock would be fine. I'm not a huge fan of taking file-wide locks for something that has a naturally finer granularity. It tends to bite us when weirdos with giant databases mmap the whole thing and then whine about contention on the rwsem during page faults. But you're right that unmap_mapping_range() already takes this lock for removing pages from the page table, so it would be reasonable to take it for read when adding pages to the page table. Something like taking the i_mmap_lock_read(file->f_mapping) in filemap_fault, then adding a new VM_FAULT_I_MMAP_LOCKED bit so that do_read_fault() and friends add: if (ret & VM_FAULT_I_MMAP_LOCKED) i_mmap_unlock_read(vmf->vma->vm_file->f_mapping); else unlock_page(page); ... want me to turn that into a real patch?