Dan Williams <dan.j.williams@xxxxxxxxx> writes: > Alistair Popple wrote: > [..] > >> >> > It follows that that the DMA-idle condition still needs to look for the >> >> > case where the refcount is > 1 rather than 0 since refcount == 1 is the >> >> > page-mapped-but-DMA-idle condition. >> >> Because if the DAX page-cache holds a reference the refcount won't go to >> zero until dax_delete_mapping_entry() is called. However this interface >> seems really strange to me - filesystems call >> dax_layout_busy_page()/dax_wait_page_idle() to make sure both user-space >> and DMA[1] have finished with the page, but not the DAX code which still >> has references in it's page-cache. > > First, I appreciate the clarification that I was mixing up "mapped" > (elevated map count) with, for lack of a better term, "tracked" (mapping > entry valid). > > So, to repeat back to you what I understand now, the proposal is to > attempt to allow _count==0 as the DMA idle condition, but still have the > final return of the block to the allocator (fs allocator) occur after > dax_delete_mapping_entry(). Right, that is what I would like to achieve if possible. The outstanding question I think is "should the DAX page-cache have a reference on the page?". Or to use your terminology below "if a pfn is tracked should pfn_to_page(pfn)->_refcount == 0 or 1?" This version implements it as being zero because altering that requires re-ordering all the existing filesystem and mm users of dax_layout_busy_range() and dax_delete_mapping_entry(). Based on this discussion though I'm beginning to think it probably should be one, but I haven't been able to make that work yet. >> Is there some reason for this? In order words why can't the interface to >> the filesystem be something like calling dax_break_layouts() which >> ensures everything, including core FS DAX code, has finished with the >> page(s) in question? I don't see why that wouldn't work for at least >> EXT4 and XFS (FUSE seemed a bit different but I haven't dug too deeply). >> >> If we could do that dax_break_layouts() would essentially: >> 1. unmap userspace via eg. unmap_mapping_pages() to drive the refcount >> down. > > Am I missing where unmap_mapping_pages() drops the _count? I can see > where it drops _mapcount. I don't think that matters for the proposal, > but that's my last gap in tracking the proposed refcount model. It is suitably obtuse due to MMU_GATHER. unmap_mapping_pages() drops the folio/page reference after flushing the TLB. Ie: => tlb_finish_mmu => tlb_flush_mmu => __tlb_batch_free_encoded_pages => free_pages_and_swap_cache => folios_put_refs >> 2. delete the DAX page-cache entry to remove its refcount. >> 3. wait for DMA to complete by waiting for the refcount to hit zero. >> >> The problem with the filesystem truncate code at the moment is steps 2 >> and 3 are reversed so step 3 has to wait for a refcount of 1 as you >> pointed out previously. But does that matter? Are there ever cases when >> a filesystem needs to wait for the page to be idle but maintain it's DAX >> page-cache entry? > > No, not that I can think of. The filesystem just cares that the page was > seen as part of the file at some point and that it is holding locks to > keep the block associated with that page allocated to the file until it > can complete its operation. > > I think what we are talking about is a pfn-state not a page state. If > the block-pfn-page lifecycle from allocation to free is deconstructed as: > > block free > block allocated > pfn untracked > pfn tracked > page free > page busy > page free > pfn untracked > block free > > ...then I can indeed see cases where there is pfn metadata live even > though the page is free. > > So I think I was playing victim to the current implementation that > assumes that "pfn tracked" means the page is allocated and that > pfn_to_folio(pfn)->mapping is valid and not NULL. > > All this to say I am at least on the same page as you that _count == 0 > can be used as the page free state even if the pfn tracking goes through > delayed cleanup. Great, and I like this terminology of pfn tracked, etc. > However, if vmf_insert_XXX is increasing _count then, per my > unmap_mapping_pages() question above, I think dax_wait_page_idle() needs > to call try_to_unmap() to drop that _count, right? At the moment filesystems open-code their own version of XXXX_break_layouts() which typically calls dax_layout_busy_page() followed by dax_wait_page_idle(). The former will call unmap_mapping_range(), which for shared mappings I thought should be sufficient to find and unmap all page table references (and therefore folio/page _refcounts) based on the address space / index. I think try_to_unmap() would only be neccessary if we only had the folio and not the address space / index and therefore needed to find them from the mm (not fs!) rmap. > Similar observation for the memory_failure_dev_pagemap() path, I think > that path only calls unmap_mapping_range() not try_to_unmap() and > leaves _count elevated. As noted above unmap_mapping_range() will drop the refcount whenever it clears a pte/pmd mapping the folio and I think it should find all the pte's mapping it. > Lastly walking through the code again I think this fix is valid today: > > diff --git a/fs/dax.c b/fs/dax.c > index fcbe62bde685..48f2c85690e1 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -660,7 +660,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping, > pgoff_t end_idx; > XA_STATE(xas, &mapping->i_pages, start_idx); > > - if (!dax_mapping(mapping) || !mapping_mapped(mapping)) > + if (!dax_mapping(mapping)) > return NULL; > > /* If end == LLONG_MAX, all pages from start to till end of file */ > > > ...because unmap_mapping_pages() will mark the mapping as unmapped even > though there are "pfn tracked + page busy" entries to clean up. Yep, I noticed this today when I was trying to figure out why my re-ordering of the unmap/wait/untrack pfn wasn't working as expected. It still isn't for some other reason, and I'm still figuring out if the above is correct/valid, but it is on my list of things to look more closely at. > Appreciate you grappling this with me! Not at all! And thank you as well ... I feel like this has helped me a lot in getting a slightly better understanding of the problems. Also unless you react violently to anything I've said here I think I have enough material to post (and perhaps even explain!) the next version of this series. - Alistair