On Sat, Sep 12, 2020 at 07:39:31PM -0700, Linus Torvalds wrote: > On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > Hmmmm. So this is a typically a truncate race check, but this isn't > > sufficient to protect the fault against all page invalidation races > > as the page can be re-inserted into the same mapping at a different > > page->index now within EOF. > > Using some "move" ioctl or similar and using a "invalidate page > mapping, then move it to a different point" model? Right, that's the sort of optimisation we could do inside a FALLOC_FL_{COLLAPSE,INSERT}_RANGE operation if we wanted to preserve the page cache contents instead of invalidating it. > Yeah. I think that ends up being basically an extended special case of > the truncate thing (for the invalidate), and would require the > filesystem to serialize externally to the page anyway. *nod* > Which they would presumably already do with the MMAPLOCK or similar, > so I guess that's not a huge deal. > > The real worry with (d) is that we are using the page lock for other > things too, not *just* the truncate check. Things where the inode lock > wouldn't be helping, like locking against throwing pages out of the > page cache entirely, or the hugepage splitting/merging etc. It's not > being truncated, it's just the VM shrinking the cache or modifying > things in other ways. Yes, that is a problem, and us FS people don't know/see all the places this can occur. We generally find out about them when one of our regression stress tests trips over a data corruption. :( I have my doubts that complex page cache manipulation operations like ->migrate_page that rely exclusively on page and internal mm serialisation are really safe against ->fallocate based invalidation races. I think they probably also need to be wrapped in the MMAPLOCK, but I don't understand all the locking and constraints that ->migrate_page has and there's been no evidence yet that it's a problem so I've kinda left that alone. I suspect that "no evidence" thing comes from "filesystem people are largely unable to induce page migrations in regression testing" so it has pretty much zero test coverage.... Stuff like THP splitting hasn't been an issue for us because the file-backed page cache does not support THP (yet!). That's something I'll be looking closely at in Willy's upcoming patchset. > So I do worry a bit about trying to make things per-inode (or even > some per-range thing with a smarter lock) for those reasons. We use > the page lock not just for synchronizing with filesystem operations, > but for other page state synchronization too. Right, I'm not suggesting the page lock goes away, just saying that we actually need two levels of locking for file-backed pages - one filesystem, one page level - and that carefully selecting where we "aggregate" the locking for complex multi-object operations might make the overall locking simpler. > In many ways I think keeping it as a page-lock, and making the > filesystem operations just act on the range of pages would be safer. Possibly, but that "range of pages" lock still doesn't really solve the filesystem level serialisation problem. We have to prevent page faults from running over a range even when there aren't pages in the page cache over that range (i.e. after we invalidate the range). Hence we cannot rely on anything struct page related - the serialisation mechanism has to be external to the cached pages themselves, but it also has to integrate cleanly into the existing locking and transaction ordering constraints we have. > But the page locking code does have some extreme downsides, exactly > because there are so _many_ pages and we end up having to play some > extreme size games due to that (ie the whole external hashing, but > also just not being able to use any debug locks etc, because we just > don't have the resources to do debugging locks at that kind of > granularity). *nod* The other issue here is that serialisation via individual cache object locking just doesn't scale in any way to the sizes of operations that fallocate() can run. fallocate() has 64 bit operands, so a user could ask us to lock down a full 8EB range of file. Locking that page by page, even using 1GB huge page Xarray slot entries, is just not practical... :/ Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx