On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote: > tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to > map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages() > for fsdax and devdax inodes, and use page_maybe_dma_pinned() to > determine when filesystems can safely truncate DAX mappings vs DMA. > > The longer story is that DAX has caused friction with folio development > and other device-memory use cases due to its hack of using a > page-reference count of 1 to indicate that the page is DMA idle. That > situation arose from the mistake of not managing DAX page reference > counts at map time. The lack of page reference counting at map time grew > organically from the original DAX experiment of attempting to manage DAX > mappings without page structures. The page lock, dirty tracking and > other entry management was supported sans pages. However, the page > support was then bolted on incrementally so solve problems with gup, > memory-failure, and all the other kernel services that are missing when > a pfn does not have an associated page structure. > > Since then John has led an effort to account for when a page is pinned > for DMA vs other sources that elevate the reference count. The > page_maybe_dma_pinned() helper slots in seamlessly to replace the need > to track transitions to page->_refount == 1. > > The larger change in this set comes from Jason's observation that > inserting DAX mappings without any reference taken is a bug. So > dax_insert_entry(), that fsdax uses, is updated to take 'struct > dev_pagemap' references, and devdax is updated to reuse the same. It wasn't pagemap references that were the problem, it was struct page references. pagemap is just something that should be ref'd in the background, as long as a struct page has a positive reference the pagemap should be considered referenced, IMHO free_zone_device_page() should be dealing with this - put the pagemap after calling page_free(). Pagemap is protecting page->pgmap from UAF so we must ensure we hold it when we do pgmap->ops That should be the only put, and it should pair with the only get which happens when the driver takes a 0 refcount page out of its free list and makes it have a refcount of 1. > page mapping helpers. One of the immediate hurdles is the usage of > pmd_devmap() to distinguish large page mappings that are not transparent > huge pages. And this is because the struct page refcounting is not right :| I had thought the progression would be to make fsdax use compound folios, install compound folios in the PMD, remove all the special case refcounting for DAX from the pagetable code, then address the pgmap issue from the basis of working page->refcount, eg by putting a pgmap put in right after the op->page_free call. Can we continue to have the weird page->refcount behavior and still change the other things? Jason