On Sat, Sep 12, 2020 at 5:41 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > Hmmmm. So this is a typically a truncate race check, but this isn't > sufficient to protect the fault against all page invalidation races > as the page can be re-inserted into the same mapping at a different > page->index now within EOF. Using some "move" ioctl or similar and using a "invalidate page mapping, then move it to a different point" model? Yeah. I think that ends up being basically an extended special case of the truncate thing (for the invalidate), and would require the filesystem to serialize externally to the page anyway. Which they would presumably already do with the MMAPLOCK or similar, so I guess that's not a huge deal. The real worry with (d) is that we are using the page lock for other things too, not *just* the truncate check. Things where the inode lock wouldn't be helping, like locking against throwing pages out of the page cache entirely, or the hugepage splitting/merging etc. It's not being truncated, it's just the VM shrinking the cache or modifying things in other ways. So I do worry a bit about trying to make things per-inode (or even some per-range thing with a smarter lock) for those reasons. We use the page lock not just for synchronizing with filesystem operations, but for other page state synchronization too. In many ways I think keeping it as a page-lock, and making the filesystem operations just act on the range of pages would be safer. But the page locking code does have some extreme downsides, exactly because there are so _many_ pages and we end up having to play some extreme size games due to that (ie the whole external hashing, but also just not being able to use any debug locks etc, because we just don't have the resources to do debugging locks at that kind of granularity). That's somewhat more longer-term. I'll try to do another version of the "hybrid fairness" page lock (and/or just try some limited optimistic spinning) to see if I can at least avoid the nasty regression. Admittedly it really probably only happens for these kinds of microbenchmarks that just hammer on one page over and over again, but it's a big enough regression for a "real enough" load that I really don't like it. Linus