Alistair Popple wrote: [..] >> I'm not really following this scenario, or at least how it relates to > >> the comment above. If the page is pinned for DMA it will have taken a > >> refcount on it and so the page won't be considered free/idle per > >> dax_wait_page_idle() or any of the other mm code. > > > > [ tl;dr: I think we're ok, analysis below, but I did talk myself into > > the proposed dax_busy_page() changes indeed being broken and needing to > > remain checking for refcount > 1, not > 0 ] > > > > It's not the mm code I am worried about. It's the filesystem block > > allocator staying in-sync with the allocation state of the page. > > > > fs/dax.c is charged with converting idle storage blocks to pfns to > > mapped folios. Once they are mapped, DMA can pin the folio, but nothing > > in fs/dax.c pins the mapping. In the pagecache case the page reference > > is sufficient to keep the DMA-busy page from being reused. In the dax > > case something needs to arrange for DMA to be idle before > > dax_delete_mapping_entry(). > > Ok. How does that work today? My current mental model is that something > has to call dax_layout_busy_page() whilst holding the correct locks to > prevent a new mapping being established prior to calling > dax_delete_mapping_entry(). Is that correct? Correct. dax_delete_mapping_entry() is invoked by the filesystem with inode locks held. See xfs_file_fallocate() where it takes the lock, calls xfs_break_layouts() and if that succeeds performs xfs_file_free_space() with the lock held. xfs_file_free_space() triggers dax_delete_mapping_entry() with knowledge that the mapping cannot be re-established until the lock is dropped. > > However, looking at XFS it indeed makes that guarantee. First it does > > xfs_break_dax_layouts() then it does truncate_inode_pages() => > > dax_delete_mapping_entry(). > > > > It follows that that the DMA-idle condition still needs to look for the > > case where the refcount is > 1 rather than 0 since refcount == 1 is the > > page-mapped-but-DMA-idle condition. > > Sorry, but I'm still not following this line of reasoning. If the > refcount == 1 the page is either mapped xor DMA-busy. No, my expectation is the refcount is 1 while the page has a mapping entry, analagous to an idle / allocated page cache page, and the refcount is 2 or more for DMA, get_user_pages(), or any page walker that takes a transient page pin. > is enough to conclude that the page cannot be reused because it is > either being accessed from userspace via a CPU mapping or from some > device DMA or some other in kernel user. Userspace access is not a problem, that access can always be safely revoked by unmapping the page, and that's what dax_layout_busy_page() does to force a fault and re-taking the inode + mmap locks so that the truncate path knows it has temporary exclusive access to the page, pfn, and storage-block association. > The current proposal is that dax_busy_page() returns true if refcount >= > 1, and dax_wait_page_idle() will wait until the refcount == > 0. dax_busy_page() will try and force the refcount == 0 by unmapping it, > but obviously can't force other pinners to release their reference hence > the need to wait. Callers should already be holding locks to ensure new > mappings can't be established and hence can't become DMA-busy after the > unmap. Am I missing a page_ref_dec() somewhere? Are you saying that dax_layout_busy_page() will find entries with ->mapping non-NULL and refcount == 0? [..] > >> >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, > >> >> if (dax_fault_is_synchronous(iter, vmf->vma)) > >> >> return dax_fault_synchronous_pfnp(pfnp, pfn); > >> >> > >> >> - /* insert PMD pfn */ > >> >> + page = pfn_t_to_page(pfn); > >> > > >> > I think this is clearer if dax_insert_entry() returns folios with an > >> > elevated refrence count that is dropped when the folio is invalidated > >> > out of the mapping. > >> > >> I presume this comment is for the next line: > >> > >> + page_ref_inc(page); > >> > >> I can move that into dax_insert_entry(), but we would still need to > >> drop it after calling vmf_insert_*() to ensure we get the 1 -> 0 > >> transition when the page is unmapped and therefore > >> freed. Alternatively we can make it so vmf_insert_*() don't take > >> references on the page, and instead ownership of the reference is > >> transfered to the mapping. Personally I prefered having those > >> functions take their own reference but let me know what you think. > > > > Oh, the model I was thinking was that until vmf_insert_XXX() succeeds > > then the page was never allocated because it was never mapped. What > > happens with the code as proposed is that put_page() triggers page-free > > semantics on vmf_insert_XXX() failures, right? > > Right. And actually that means I can't move the page_ref_inc(page) into > what will be called dax_create_folio(), because an entry may have been > created previously that had a failed vmf_insert_XXX() which will > therefore have a zero refcount folio associated with it. I would expect a full cleanup on on vmf_insert_XXX() failure, not leaving a zero-referenced entry. > But I think that model is wrong. I think the model needs to be the page > gets allocated when the entry is first created (ie. when > dax_create_folio() is called). A subsequent free (ether due to > vmf_insert_XXX() failing or the page being unmapped or becoming > DMA-idle) should then delete the entry. > > I think that makes the semantics around dax_busy_page() nicer as well - > no need for the truncate to have a special path to call > dax_delete_mapping_entry(). I agree it would be lovely if the final put could clean up the mapping entry and not depend on truncate_inode_pages_range() to do that. ...but I do not immediately see how to get there when block, pfn, and page are so tightly coupled with dax. That's a whole new project to introduce that paradigm, no? The page cache case gets away with it by safely disconnecting the pfn+page from the block and then letting DMA final put_page() take its time. > > There is no need to invoke the page-free / final-put path on > > vmf_insert_XXX() error because the storage-block / pfn never actually > > transitioned into a page / folio. > > It's not mapping a page/folio that transitions a pfn into a page/folio > it is the allocation of the folio that happens in dax_create_folio() > (aka. dax_associate_new_entry()). So we need to delete the entry (as > noted above I don't do that currently) if the insertion fails. Yeah, deletion on insert failure makes sense. [..] > >> >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page) > >> >> lock_page(page); > >> >> } > >> >> EXPORT_SYMBOL_GPL(zone_device_page_init); > >> >> - > >> >> -#ifdef CONFIG_FS_DAX > >> >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs) > >> >> -{ > >> >> - if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX) > >> >> - return false; > >> >> - > >> >> - /* > >> >> - * fsdax page refcounts are 1-based, rather than 0-based: if > >> >> - * refcount is 1, then the page is free and the refcount is > >> >> - * stable because nobody holds a reference on the page. > >> >> - */ > >> >> - if (folio_ref_sub_return(folio, refs) == 1) > >> >> - wake_up_var(&folio->_refcount); > >> >> - return true; > >> > > >> > It follow from the refcount disvussion above that I think there is an > >> > argument to still keep this wakeup based on the 2->1 transitition. > >> > pagecache pages are refcount==1 when they are dma-idle but still > >> > allocated. To keep the same semantics for dax a dax_folio would have an > >> > elevated refcount whenever it is referenced by mapping entry. > >> > >> I'm not sold on keeping it as it doesn't seem to offer any benefit > >> IMHO. I know both Jason and Christoph were keen to see it go so it be > >> good to get their feedback too. Also one of the primary goals of this > >> series was to refcount the page normally so we could remove the whole > >> "page is free with a refcount of 1" semantics. > > > > The page is still free at refcount 0, no argument there. But, by > > introducing a new "page refcount is elevated while mapped" (as it > > should), it follows that "page is DMA idle at refcount == 1", right? > > No. The page is either mapped xor DMA-busy - ie. not free. If we want > (need?) to tell the difference we can use folio_maybe_dma_pinned(), > assuming the driver doing DMA has called pin_user_pages() as it should. > > That said I'm not sure why we care about the distinction between > DMA-idle and mapped? If the page is not free from the mm perspective the > block can't be reallocated by the filesystem. "can't be reallocated", what enforces that in your view? I am hoping it is something I am overlooking. In my view the filesystem has no idea of this page-to-block relationship. All it knows is that when it wants to destroy the page-to-block association, dax notices and says "uh, oh, this is my last chance to make sure the block can go back into the fs allocation pool so I need to wait for the mm to say that the page is exclusive to me (dax core) before dax_delete_mapping_entry() destroys the page-to-block association and the fs reclaims the allocation". > > Otherwise, the current assumption that fileystems can have > > dax_layout_busy_page_range() poll on the state of the pfn in the mapping > > is broken because page refcount == 0 also means no page to mapping > > association. > > And also means nothing from the mm (userspace mapping, DMA-busy, etc.) > is using the page so the page isn't busy and is free to be reallocated > right? Lets take the 'map => start dma => truncate => end dma' scenario. At the 'end dma' step, how does the filesystem learn that the block that it truncated, potentially hours ago, is now a free block? The filesystem thought it reclaimed the block when truncate completed. I.e. dax says, thou shalt 'end dma' => 'truncate' in all cases. Note "dma" can be replaced with "any non dax core page_ref".