Re: ZONE_DEVICE refcounting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Dan Williams <dan.j.williams@xxxxxxxxx> writes:

> Alistair Popple wrote:
>> Hi,
>> I have been looking at fixing up ZONE_DEVICE refcounting again. Specifically I
>> have been looking  at fixing the 1-based refcounts that are currently used for
>> FS DAX pages (and p2pdma pages, but that's trival).
>> This started with the simple idea of "just subtract one from the
>> refcounts everywhere and that will fix the off by one". Unfortunately
>> it's not that simple. For starters doing a simple conversion like that
>> requires allowing pages to be mapped with zero refcounts. That seems
>> wrong. It also leads to problems detecting idle IO vs. page map pages.
>> So instead I'm thinking of doing something along the lines of the following:
>> 1. Refcount FS DAX pages normally. Ie. map them with vm_insert_page() and
>>    increment the refcount inline with mapcount and decrement it when pages are
>>    unmapped.
> It has been a while but the sticking point last time was how to plumb
> the "allocation" mechanism that elevated the page from 0 to 1. However,
> that seems solvable.

So as a proof-of-concept I was just doing it as part of the actual page
mapping (ie. by replacing vm_insert_mixed() with vm_insert_page()) done
in dax_fault_iter().

That did have the weirdness of passing a zero-refcount page in to be
mapped, but I think that's solvable by taking a reference when
converting the pfn to a page in eg. dax_iomap_direct_access().

The issue I have just run into is the initialisation of these struct
pages is tricky. It would be ok if we didn't have compound pages.
However because we do it means we could get an access request to a
subpage that has already been mapped as a compound page and we obviously
can't just switch the struct page back and forth on every
dax_iomap_direct_access() call.

But I've been reading the DAX pagecache implementation and it looks like
doing the page initialisation there is actually the correct spot as it
already deals with some of this. I've also just discovered the
page->share overloading which wasn't a highlight of my day :-)

Thankfully I think that can also just get rolled into the refcounting if
we do that properly.


>> 4. PMD sized FS DAX pages get treated the same as normal compound pages.
> Here potentially be dragons. There are pud_devmap() checks in places
> where mm code needs to be careful not to treat a dax page as a typical
> transhuge page that can be split.

Interesting. Thanks for pointing that out - I'd overlooked the fact
pXd_trans_huge() implies !pXd_devmap(). Most callers end up checking
both though, and I don't quite understand the issue with splitting.

Specifically things like __split_huge_pud() do pretty much the same
thing for pud_trans_huge() and pud_devmap() which is to clear the
pud/pmd (although treating FS DAX pages as normal compound pages removes
some special cases).

And folio splitting already seems like it would have dragons given these
are not LRU pages and hence can't be split to lists, etc. Most of the
code I looked at there checked that though, and if we have the folio we
can easily look up the zone anyway.

I also noticed folio_anon() is not safe to call on a FS DAX page due to


>> I have made good progress implementing the above, and am reasonably confident I
>> can make it work (I have some tests that exercise these code paths working).
> Wow, that's great! Really appreciate and will be paying you back with
> review cycles.

Thanks! Do you have a preferred test suite that you use to stress FS
DAX?  I wrote a couple of directed "tests" to get an understanding of
and to exercise some of the tricker code paths (eg. GUP). Have also
tried the DAX xfstests but they were passing which made me suspicious.

>> However my knowledge of the filesystem layer is a bit thin, so before going too
>> much further down this path I was hoping to get some feedback on the overall
>> direction to see if there are any corner cases or other potential problems I
>> have missed that may prevent the above being practical.
> If you want to send me draft patches for that on or offlist feel free.

Great, once I figure out the compound page initialisation I should have
something reasonable to send.

>> If not I will clean my series up and post it as an RFC. Thanks.
> Thanks, Alistair!

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux