Re: [PATCH v2 10/18] fsdax: Manage pgmap references at entry insertion and deletion

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 23 Sep 2022 16:23:26 -0300

On Fri, Sep 23, 2022 at 12:03:53PM -0700, Dan Williams wrote:

> Perhaps, I'll take a look. The scenario I am more concerned about is
> processA sets up a VMA of PAGE_SIZE and races processB to fault in the
> same filesystem block with a VMA of PMD_SIZE. Right now processA gets a
> PTE mapping and processB gets a PMD mapping, but the refcounting is all
> handled in small pages. I need to investigate more what is needed for
> fsdax to support folio_size() > mapping entry size.

This is fine actually.

The PMD/PTE can hold a tail page. So the page cache will hold a PMD
sized folio, procesA will have a PTE pointing to a tail page and
processB will have a PMD pointing at the head page.

For the immediate instant you can keep accounting for each tail page
as you do now, just with folio wrappers. Once you have proper folios
you shift the accounting responsibility to the core code and the core
will faster with one ref per PMD/PTE.

The trick with folios is probably going to be breaking up a folio. THP
has some nasty stuff for that, but I think a FS would be better to
just revoke the entire folio, bring the refcount to 0, change the
underling physical mapping, and then fault will naturally restore a
properly sized folio to accomodate the new physical layout.

ie you never break up a folio once it is created from the pgmap.

What you want is to have largest possibile folios because it optimizes
all the handling logic.

.. and then you are well positioned to do some kind of trick where the
FS asserts at mount time that it never needs a folio less than order X
and you can then trigger the devdax optimization of folding struct
page memory and significantly reducing the wastage for struct page..

Jason