Re: ZONE_DEVICE refcounting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Alistair Popple wrote:
> Hi,
> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I
> have been looking  at fixing the 1-based refcounts that are currently used for
> FS DAX pages (and p2pdma pages, but that's trival).
> This started with the simple idea of "just subtract one from the
> refcounts everywhere and that will fix the off by one". Unfortunately
> it's not that simple. For starters doing a simple conversion like that
> requires allowing pages to be mapped with zero refcounts. That seems
> wrong. It also leads to problems detecting idle IO vs. page map pages.
> So instead I'm thinking of doing something along the lines of the following:
> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and
>    increment the refcount inline with mapcount and decrement it when pages are
>    unmapped.

It has been a while but the sticking point last time was how to plumb
the "allocation" mechanism that elevated the page from 0 to 1. However,
that seems solvable.

> 2. As per normal pages the pages are considered free when the refcount drops
>    to zero.

That is the dream, yes.

> 3. Because these are treated as normal pages for refcounting we no longer map
>    them as pte_devmap() (possibly freeing up a PTE bit).

Yeah, pte_devmap() dies once mapcount behaves normally.

> 4. PMD sized FS DAX pages get treated the same as normal compound pages.

Here potentially be dragons. There are pud_devmap() checks in places
where mm code needs to be careful not to treat a dax page as a typical
transhuge page that can be split.

> 5. This means we need to allow compound ZONE DEVICE pages. Tail pages share
>    the page->pgmap field with page->compound_head, but this isn't a problem
>    because the LSB of page->pgmap is free and we can still get pgmap from
>    compound_head(page)->pgmap.

Sounds plausible.

> 6. When FS DAX pages are freed they notify filesystem drivers. This can be done
>    from the pgmap->ops->page_free() callback.

Yes necessary for DAX-GUP iteractions.

> 7. We could probably get rid of the pgmap refcounting because we can just scan
>    pages and look for any pages with non-zero references and wait for them to be
>    freed whilst ensuring no new mappings can be created (some drivers do a
>    similar thing for private pages today). This might be a follow-up change.

This sounds reasonable.

> I have made good progress implementing the above, and am reasonably confident I
> can make it work (I have some tests that exercise these code paths working).

Wow, that's great! Really appreciate and will be paying you back with
review cycles.

> However my knowledge of the filesystem layer is a bit thin, so before going too
> much further down this path I was hoping to get some feedback on the overall
> direction to see if there are any corner cases or other potential problems I
> have missed that may prevent the above being practical.

If you want to send me draft patches for that on or offlist feel free.

> If not I will clean my series up and post it as an RFC. Thanks.

Thanks, Alistair!

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux