On Fri, Sep 23, 2022 at 12:03:53PM -0700, Dan Williams wrote: > Perhaps, I'll take a look. The scenario I am more concerned about is > processA sets up a VMA of PAGE_SIZE and races processB to fault in the > same filesystem block with a VMA of PMD_SIZE. Right now processA gets a > PTE mapping and processB gets a PMD mapping, but the refcounting is all > handled in small pages. I need to investigate more what is needed for > fsdax to support folio_size() > mapping entry size. This is fine actually. The PMD/PTE can hold a tail page. So the page cache will hold a PMD sized folio, procesA will have a PTE pointing to a tail page and processB will have a PMD pointing at the head page. For the immediate instant you can keep accounting for each tail page as you do now, just with folio wrappers. Once you have proper folios you shift the accounting responsibility to the core code and the core will faster with one ref per PMD/PTE. The trick with folios is probably going to be breaking up a folio. THP has some nasty stuff for that, but I think a FS would be better to just revoke the entire folio, bring the refcount to 0, change the underling physical mapping, and then fault will naturally restore a properly sized folio to accomodate the new physical layout. ie you never break up a folio once it is created from the pgmap. What you want is to have largest possibile folios because it optimizes all the handling logic. .. and then you are well positioned to do some kind of trick where the FS asserts at mount time that it never needs a folio less than order X and you can then trigger the devdax optimization of folding struct page memory and significantly reducing the wastage for struct page.. Jason