Jason Gunthorpe wrote: > On Wed, Sep 07, 2022 at 01:45:35PM -0700, Dan Williams wrote: > > Jason Gunthorpe wrote: > > > On Wed, Sep 07, 2022 at 11:43:52AM -0700, Dan Williams wrote: > > > > > > > It is still the case that while waiting for the page to go idle it is > > > > associated with its given file / inode. It is possible that > > > > memory-failure, or some other event that requires looking up the page's > > > > association, fires in that time span. > > > > > > Can't the page->mapping can remain set to the address space even if it is > > > not installed into any PTEs? Zap should only remove the PTEs, not > > > clear the page->mapping. > > > > > > Or, said another way, page->mapping should only change while the page > > > refcount is 0 and thus the filesystem is completely in control of when > > > it changes, and can do so under its own locks > > > > > > If the refcount is 0 then memory failure should not happen - it would > > > require someone accessed the page without referencing it. The only > > > thing that could do that is the kernel, and if the kernel is > > > referencing a 0 refcount page (eg it got converted to meta-data or > > > something), it is probably not linked to an address space anymore > > > anyhow? > > > > First, thank you for helping me think through this, I am going to need > > this thread in 6 months when I revisit this code. > > > > I agree with the observation that page->mapping should only change while > > the reference count is zero, but my problem is catching the 1 -> 0 in > > its natural location in free_zone_device_page(). That and the fact that > > the entry needs to be maintained until the page is actually disconnected > > from the file to me means that break layouts holds off truncate until it > > can observe the 0 refcount condition while holding filesystem locks, and > > then the final truncate deletes the mapping entry which is already at 0. > > Okay, that makes sense to me.. but what is "entry need to be > maintained" mean? I am talking about keeping an entry in the address_space until that page is truncated out of the file, and I think the "zapped but not truncated" state needs a new Xarray-value flag (DAX_ZAP) to track it. So the life-cycle of a DAX page that is truncated becomes: 0/ devm_memremap_pages() to instantiate the page with _refcount==0 1/ dax_insert_entry() and vmf_insert_mixed() add an entry to the address_space and install a pte for the page. _refcount==1. 2/ gup elevates _refcount to 2 3/ truncate or punch hole attempts to free the DAX page 4/ break layouts zaps the ptes, drops the reference from 1/, and waits for _refcount to drop to zero while holding fs locks. 5/ at the 1 -> 0 transition the address_space entry is tagged with a new flag DAX_ZAP to track that this page is unreferenced, but still associated with the mapping until the final truncate. I.e. the DAX_ZAP flag lets the fsdax core track when it has already dropped a page reference, but still has use for things like memory-failure to opportunistically use page->mapping on a 0-reference page. I think this could be done without the DAX_ZAP flag, but I want to have some safety to catch programming errors where the truncate path finds entries already at a zero reference count without having first been zapped. > > I.e. break layouts waits until _refcount reaches 0, but entry removal > > still needs one more dax_delete_mapping_entry() event to transitition to > > the _refcount == 0 plus no address_space entry condition. Effectively > > simulating _mapcount with address_space tracking until DAX pages can > > become vm_normal_page(). > > This I don't follow.. Who will do the one more > dax_delete_mapping_entry()? The core of dax_delete_mapping_entry() is __dax_invalidate_entry(). I am thinking something like __dax_invalidate_entry(mapping, index, ZAP) is called for break layouts and __dax_invalidate_entry(mapping, index, TRUNC) is called for finally disconnecting that page from its mapping. > > I'm not sure what it has to do with normal_page? > This thread is mainly about DAX slowly reinventing _mapcount that gets managed in all the right places for a normal_page. Longer term I think we either need to get back to the page-less DAX experiment and find a way to unwind some of this page usage creep, or cut over to finally making DAX-pages be normal pages and delete most of the special case handling in fs/dax.c. I am open to discussing the former, but I think the latter is more likely.