Alistair Popple wrote: [..] > > > > Was there a discussion I missed about why the conversion to typical > > folios allows the page->share accounting to be dropped. > > The problem with keeping it is we now treat DAX pages as "normal" > pages according to vm_normal_page(). As such we use the normal paths > for unmapping pages. > > Specifically page->share accounting relies on PAGE_MAPPING_DAX_SHARED > aka PAGE_MAPPING_ANON which causes folio_test_anon(), PageAnon(), > etc. to return true leading to all sorts of issues in at least the > unmap paths. Oh, I missed that PAGE_MAPPING_DAX_SHARED aliases with PAGE_MAPPING_ANON. > There hasn't been a previous discussion on this, but given this is > only used to print warnings it seemed easier to get rid of it. I > probably should have called that out more clearly in the commit > message though. > > > I assume this is because the page->mapping validation was dropped, which > > I think might be useful to keep at least for one development cycle to > > make sure this conversion is not triggering any of the old warnings. > > > > Otherwise, the ->share field of 'struct page' can also be cleaned up. > > Yes, we should also clean up the ->share field, unless you have an > alternate suggestion to solve the above issue. kmalloc mininimum alignment is 8, so there is room to do this? --- diff --git a/fs/dax.c b/fs/dax.c index c62acd2812f8..a70f081c32cb 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -322,7 +322,7 @@ static unsigned long dax_end_pfn(void *entry) static inline bool dax_page_is_shared(struct page *page) { - return page->mapping == PAGE_MAPPING_DAX_SHARED; + return folio_test_dax_shared(page_folio(page)); } /* @@ -331,14 +331,14 @@ static inline bool dax_page_is_shared(struct page *page) */ static inline void dax_page_share_get(struct page *page) { - if (page->mapping != PAGE_MAPPING_DAX_SHARED) { + if (!dax_page_is_shared(page)) { /* * Reset the index if the page was already mapped * regularly before. */ if (page->mapping) page->share = 1; - page->mapping = PAGE_MAPPING_DAX_SHARED; + page->mapping = (void *)PAGE_MAPPING_DAX_SHARED; } page->share++; } diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 1b3a76710487..21b355999ce0 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -666,13 +666,14 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted) #define PAGE_MAPPING_ANON 0x1 #define PAGE_MAPPING_MOVABLE 0x2 #define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) -#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE) +/* to be removed once typical page refcounting for dax proves stable */ +#define PAGE_MAPPING_DAX_SHARED 0x4 +#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE | PAGE_MAPPING_DAX_SHARED) /* * Different with flags above, this flag is used only for fsdax mode. It * indicates that this page->mapping is now under reflink case. */ -#define PAGE_MAPPING_DAX_SHARED ((void *)0x1) static __always_inline bool folio_mapping_flags(const struct folio *folio) { @@ -689,6 +690,11 @@ static __always_inline bool folio_test_anon(const struct folio *folio) return ((unsigned long)folio->mapping & PAGE_MAPPING_ANON) != 0; } +static __always_inline bool folio_test_dax_shared(const struct folio *folio) +{ + return ((unsigned long)folio->mapping & PAGE_MAPPING_DAX_SHARED) != 0; +} + static __always_inline bool PageAnon(const struct page *page) { return folio_test_anon(page_folio(page)); --- ...and keep the validation around at least for one post conversion development cycle? > > It does have implications for the dax dma-idle tracking thought, see > > below. > > > >> { > >> - unsigned long pfn; > >> + unsigned long order = dax_entry_order(entry); > >> + struct folio *folio = dax_to_folio(entry); > >> > >> - if (IS_ENABLED(CONFIG_FS_DAX_LIMITED)) > >> + if (!dax_entry_size(entry)) > >> return; > >> > >> - for_each_mapped_pfn(entry, pfn) { > >> - struct page *page = pfn_to_page(pfn); > >> - > >> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1); > >> - if (dax_page_is_shared(page)) { > >> - /* keep the shared flag if this page is still shared */ > >> - if (dax_page_share_put(page) > 0) > >> - continue; > >> - } else > >> - WARN_ON_ONCE(page->mapping && page->mapping != mapping); > >> - page->mapping = NULL; > >> - page->index = 0; > >> - } > >> + /* > >> + * We don't hold a reference for the DAX pagecache entry for the > >> + * page. But we need to initialise the folio so we can hand it > >> + * out. Nothing else should have a reference either. > >> + */ > >> + WARN_ON_ONCE(folio_ref_count(folio)); > > > > Per above I would feel more comfortable if we kept the paranoia around > > to ensure that all the pages in this folio have dropped all references > > and cleared ->mapping and ->index. > > > > That paranoia can be placed behind a CONFIG_DEBUB_VM check, and we can > > delete in a follow-on development cycle, but in the meantime it helps to > > prove the correctness of the conversion. > > I'm ok with paranoia, but as noted above the issue is that at a minimum > page->mapping (and probably index) now needs to be valid for any code > that might walk the page tables. A quick look seems to say the confusion is limited to aliasing PAGE_MAPPING_ANON. > > [..] > >> @@ -1189,11 +1165,14 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf, > >> struct inode *inode = iter->inode; > >> unsigned long vaddr = vmf->address; > >> pfn_t pfn = pfn_to_pfn_t(my_zero_pfn(vaddr)); > >> + struct page *page = pfn_t_to_page(pfn); > >> vm_fault_t ret; > >> > >> *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE); > >> > >> - ret = vmf_insert_mixed(vmf->vma, vaddr, pfn); > >> + page_ref_inc(page); > >> + ret = dax_insert_pfn(vmf, pfn, false); > >> + put_page(page); > > > > Per above I think it is problematic to have pages live in the system > > without a refcount. > > I'm a bit confused by this - the pages have a reference taken on them > when they are mapped. They only live in the system without a refcount > when the mm considers them free (except for the bit between getting > created in dax_associate_entry() and actually getting mapped but as > noted I will fix that). > > > One scenario where this might be needed is invalidate_inode_pages() vs > > DMA. The invaldation should pause and wait for DMA pins to be dropped > > before the mapping xarray is cleaned up and the dax folio is marked > > free. > > I'm not really following this scenario, or at least how it relates to > the comment above. If the page is pinned for DMA it will have taken a > refcount on it and so the page won't be considered free/idle per > dax_wait_page_idle() or any of the other mm code. [ tl;dr: I think we're ok, analysis below, but I did talk myself into the proposed dax_busy_page() changes indeed being broken and needing to remain checking for refcount > 1, not > 0 ] It's not the mm code I am worried about. It's the filesystem block allocator staying in-sync with the allocation state of the page. fs/dax.c is charged with converting idle storage blocks to pfns to mapped folios. Once they are mapped, DMA can pin the folio, but nothing in fs/dax.c pins the mapping. In the pagecache case the page reference is sufficient to keep the DMA-busy page from being reused. In the dax case something needs to arrange for DMA to be idle before dax_delete_mapping_entry(). However, looking at XFS it indeed makes that guarantee. First it does xfs_break_dax_layouts() then it does truncate_inode_pages() => dax_delete_mapping_entry(). It follows that that the DMA-idle condition still needs to look for the case where the refcount is > 1 rather than 0 since refcount == 1 is the page-mapped-but-DMA-idle condition. [..] > >> @@ -1649,9 +1627,10 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, > >> loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT; > >> bool write = iter->flags & IOMAP_WRITE; > >> unsigned long entry_flags = pmd ? DAX_PMD : 0; > >> - int err = 0; > >> + int ret, err = 0; > >> pfn_t pfn; > >> void *kaddr; > >> + struct page *page; > >> > >> if (!pmd && vmf->cow_page) > >> return dax_fault_cow_page(vmf, iter); > >> @@ -1684,14 +1663,21 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf, > >> if (dax_fault_is_synchronous(iter, vmf->vma)) > >> return dax_fault_synchronous_pfnp(pfnp, pfn); > >> > >> - /* insert PMD pfn */ > >> + page = pfn_t_to_page(pfn); > > > > I think this is clearer if dax_insert_entry() returns folios with an > > elevated refrence count that is dropped when the folio is invalidated > > out of the mapping. > > I presume this comment is for the next line: > > + page_ref_inc(page); > > I can move that into dax_insert_entry(), but we would still need to > drop it after calling vmf_insert_*() to ensure we get the 1 -> 0 > transition when the page is unmapped and therefore > freed. Alternatively we can make it so vmf_insert_*() don't take > references on the page, and instead ownership of the reference is > transfered to the mapping. Personally I prefered having those > functions take their own reference but let me know what you think. Oh, the model I was thinking was that until vmf_insert_XXX() succeeds then the page was never allocated because it was never mapped. What happens with the code as proposed is that put_page() triggers page-free semantics on vmf_insert_XXX() failures, right? There is no need to invoke the page-free / final-put path on vmf_insert_XXX() error because the storage-block / pfn never actually transitioned into a page / folio. > > [..] > >> @@ -519,21 +529,3 @@ void zone_device_page_init(struct page *page) > >> lock_page(page); > >> } > >> EXPORT_SYMBOL_GPL(zone_device_page_init); > >> - > >> -#ifdef CONFIG_FS_DAX > >> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs) > >> -{ > >> - if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX) > >> - return false; > >> - > >> - /* > >> - * fsdax page refcounts are 1-based, rather than 0-based: if > >> - * refcount is 1, then the page is free and the refcount is > >> - * stable because nobody holds a reference on the page. > >> - */ > >> - if (folio_ref_sub_return(folio, refs) == 1) > >> - wake_up_var(&folio->_refcount); > >> - return true; > > > > It follow from the refcount disvussion above that I think there is an > > argument to still keep this wakeup based on the 2->1 transitition. > > pagecache pages are refcount==1 when they are dma-idle but still > > allocated. To keep the same semantics for dax a dax_folio would have an > > elevated refcount whenever it is referenced by mapping entry. > > I'm not sold on keeping it as it doesn't seem to offer any benefit > IMHO. I know both Jason and Christoph were keen to see it go so it be > good to get their feedback too. Also one of the primary goals of this > series was to refcount the page normally so we could remove the whole > "page is free with a refcount of 1" semantics. The page is still free at refcount 0, no argument there. But, by introducing a new "page refcount is elevated while mapped" (as it should), it follows that "page is DMA idle at refcount == 1", right? Otherwise, the current assumption that fileystems can have dax_layout_busy_page_range() poll on the state of the pfn in the mapping is broken because page refcount == 0 also means no page to mapping association.