Alistair Popple wrote: > Hi, > > I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I > have been looking at fixing the 1-based refcounts that are currently used for > FS DAX pages (and p2pdma pages, but that's trival). > > This started with the simple idea of "just subtract one from the > refcounts everywhere and that will fix the off by one". Unfortunately > it's not that simple. For starters doing a simple conversion like that > requires allowing pages to be mapped with zero refcounts. That seems > wrong. It also leads to problems detecting idle IO vs. page map pages. > > So instead I'm thinking of doing something along the lines of the following: > > 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and > increment the refcount inline with mapcount and decrement it when pages are > unmapped. It has been a while but the sticking point last time was how to plumb the "allocation" mechanism that elevated the page from 0 to 1. However, that seems solvable. > 2. As per normal pages the pages are considered free when the refcount drops > to zero. That is the dream, yes. > 3. Because these are treated as normal pages for refcounting we no longer map > them as pte_devmap() (possibly freeing up a PTE bit). Yeah, pte_devmap() dies once mapcount behaves normally. > 4. PMD sized FS DAX pages get treated the same as normal compound pages. Here potentially be dragons. There are pud_devmap() checks in places where mm code needs to be careful not to treat a dax page as a typical transhuge page that can be split. > 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share > the page->pgmap field with page->compound_head, but this isn't a problem > because the LSB of page->pgmap is free and we can still get pgmap from > compound_head(page)->pgmap. Sounds plausible. > 6. When FS DAX pages are freed they notify filesystem drivers. This can be done > from the pgmap->ops->page_free() callback. Yes necessary for DAX-GUP iteractions. > 7. We could probably get rid of the pgmap refcounting because we can just scan > pages and look for any pages with non-zero references and wait for them to be > freed whilst ensuring no new mappings can be created (some drivers do a > similar thing for private pages today). This might be a follow-up change. This sounds reasonable. > I have made good progress implementing the above, and am reasonably confident I > can make it work (I have some tests that exercise these code paths working). Wow, that's great! Really appreciate and will be paying you back with review cycles. > However my knowledge of the filesystem layer is a bit thin, so before going too > much further down this path I was hoping to get some feedback on the overall > direction to see if there are any corner cases or other potential problems I > have missed that may prevent the above being practical. If you want to send me draft patches for that on or offlist feel free. > If not I will clean my series up and post it as an RFC. Thanks. Thanks, Alistair!