Dan Williams <dan.j.williams@xxxxxxxxx> writes: > Alistair Popple wrote: >> Hi, >> >> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I >> have been looking at fixing the 1-based refcounts that are currently used for >> FS DAX pages (and p2pdma pages, but that's trival). >> >> This started with the simple idea of "just subtract one from the >> refcounts everywhere and that will fix the off by one". Unfortunately >> it's not that simple. For starters doing a simple conversion like that >> requires allowing pages to be mapped with zero refcounts. That seems >> wrong. It also leads to problems detecting idle IO vs. page map pages. >> >> So instead I'm thinking of doing something along the lines of the following: >> >> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and >> increment the refcount inline with mapcount and decrement it when pages are >> unmapped. > > It has been a while but the sticking point last time was how to plumb > the "allocation" mechanism that elevated the page from 0 to 1. However, > that seems solvable. So as a proof-of-concept I was just doing it as part of the actual page mapping (ie. by replacing vm_insert_mixed() with vm_insert_page()) done in dax_fault_iter(). That did have the weirdness of passing a zero-refcount page in to be mapped, but I think that's solvable by taking a reference when converting the pfn to a page in eg. dax_iomap_direct_access(). The issue I have just run into is the initialisation of these struct pages is tricky. It would be ok if we didn't have compound pages. However because we do it means we could get an access request to a subpage that has already been mapped as a compound page and we obviously can't just switch the struct page back and forth on every dax_iomap_direct_access() call. But I've been reading the DAX pagecache implementation and it looks like doing the page initialisation there is actually the correct spot as it already deals with some of this. I've also just discovered the page->share overloading which wasn't a highlight of my day :-) Thankfully I think that can also just get rolled into the refcounting if we do that properly. [...] >> 4. PMD sized FS DAX pages get treated the same as normal compound pages. > > Here potentially be dragons. There are pud_devmap() checks in places > where mm code needs to be careful not to treat a dax page as a typical > transhuge page that can be split. Interesting. Thanks for pointing that out - I'd overlooked the fact pXd_trans_huge() implies !pXd_devmap(). Most callers end up checking both though, and I don't quite understand the issue with splitting. Specifically things like __split_huge_pud() do pretty much the same thing for pud_trans_huge() and pud_devmap() which is to clear the pud/pmd (although treating FS DAX pages as normal compound pages removes some special cases). And folio splitting already seems like it would have dragons given these are not LRU pages and hence can't be split to lists, etc. Most of the code I looked at there checked that though, and if we have the folio we can easily look up the zone anyway. I also noticed folio_anon() is not safe to call on a FS DAX page due to sharing PAGE_MAPPING_DAX_SHARED. [...] >> I have made good progress implementing the above, and am reasonably confident I >> can make it work (I have some tests that exercise these code paths working). > > Wow, that's great! Really appreciate and will be paying you back with > review cycles. Thanks! Do you have a preferred test suite that you use to stress FS DAX? I wrote a couple of directed "tests" to get an understanding of and to exercise some of the tricker code paths (eg. GUP). Have also tried the DAX xfstests but they were passing which made me suspicious. >> However my knowledge of the filesystem layer is a bit thin, so before going too >> much further down this path I was hoping to get some feedback on the overall >> direction to see if there are any corner cases or other potential problems I >> have missed that may prevent the above being practical. > > If you want to send me draft patches for that on or offlist feel free. Great, once I figure out the compound page initialisation I should have something reasonable to send. >> If not I will clean my series up and post it as an RFC. Thanks. > > Thanks, Alistair!