On Mon, Sep 14, 2020 at 09:45:03AM +1000, Dave Chinner wrote: > I have my doubts that complex page cache manipulation operations > like ->migrate_page that rely exclusively on page and internal mm > serialisation are really safe against ->fallocate based invalidation > races. I think they probably also need to be wrapped in the > MMAPLOCK, but I don't understand all the locking and constraints > that ->migrate_page has and there's been no evidence yet that it's a > problem so I've kinda left that alone. I suspect that "no evidence" > thing comes from "filesystem people are largely unable to induce > page migrations in regression testing" so it has pretty much zero > test coverage.... Maybe we can get someone who knows the page migration code to give us a hack to induce pretty much constant migration? > Stuff like THP splitting hasn't been an issue for us because the > file-backed page cache does not support THP (yet!). That's > something I'll be looking closely at in Willy's upcoming patchset. One of the things I did was fail every tenth I/O to a THP. That causes us to split the THP when we come to try to make use of it. Far more effective than using dm-flakey because I know that failing a readahead I/O should not cause any test to fail, so any newly-failing test is caused by the THP code. I've probably spent more time looking at the page splitting and truncate/hole-punch/invalidate/invalidate2 paths than anything else. It's definitely an area where more eyes are welcome, and just having more people understand it would be good. split_huge_page_to_list and its various helper functions are about 400 lines of code and, IMO, a little too complex. > The other issue here is that serialisation via individual cache > object locking just doesn't scale in any way to the sizes of > operations that fallocate() can run. fallocate() has 64 bit > operands, so a user could ask us to lock down a full 8EB range of > file. Locking that page by page, even using 1GB huge page Xarray > slot entries, is just not practical... :/ FWIW, there's not currently a "lock down this range" mechanism in the page cache. If there were, it wouldn't be restricted to 4k/2M/1G sizes -- with the XArray today, it's fairly straightforward to lock ranges which are m * 64^n entries in size (for 1 <= m <= 63, n >=0). In the next year or two, I hope to be able to offer a "lock arbitrary page range" feature which is as cheap to lock 8EiB as it is 128KiB. It would still be page-ranges, not byte-ranges, so I don't know how well that fits your needs. It doesn't solve the DIO vs page cache problems at all, since we want DIO to ranges which happen to be within the same pages as each other to not conflict.