Jason Gunthorpe wrote: > On Sat, Sep 03, 2022 at 07:16:00PM -0700, Dan Williams wrote: > > tl;dr: Move the pin of 'struct dev_pagemap' instances from gup-time to > > map time, move the unpin of 'struct dev_pagemap' to truncate_inode_pages() > > for fsdax and devdax inodes, and use page_maybe_dma_pinned() to > > determine when filesystems can safely truncate DAX mappings vs DMA. > > > > The longer story is that DAX has caused friction with folio development > > and other device-memory use cases due to its hack of using a > > page-reference count of 1 to indicate that the page is DMA idle. That > > situation arose from the mistake of not managing DAX page reference > > counts at map time. The lack of page reference counting at map time grew > > organically from the original DAX experiment of attempting to manage DAX > > mappings without page structures. The page lock, dirty tracking and > > other entry management was supported sans pages. However, the page > > support was then bolted on incrementally so solve problems with gup, > > memory-failure, and all the other kernel services that are missing when > > a pfn does not have an associated page structure. > > > > Since then John has led an effort to account for when a page is pinned > > for DMA vs other sources that elevate the reference count. The > > page_maybe_dma_pinned() helper slots in seamlessly to replace the need > > to track transitions to page->_refount == 1. > > > > The larger change in this set comes from Jason's observation that > > inserting DAX mappings without any reference taken is a bug. So > > dax_insert_entry(), that fsdax uses, is updated to take 'struct > > dev_pagemap' references, and devdax is updated to reuse the same. > > It wasn't pagemap references that were the problem, it was struct page > references. > > pagemap is just something that should be ref'd in the background, as > long as a struct page has a positive reference the pagemap should be > considered referenced, IMHO free_zone_device_page() should be dealing > with this - put the pagemap after calling page_free(). Yes. I think I got caught admiring the solution of the page_maybe_dma_pinned() conversion for replacing the ugly observation of the 2 -> 1 refcount transition, and then introduced this breakage. I will rework this to catch the 0 to 1 transition of the refcount for incrementing and use free_zone_device_page() to drop the pgmap reference. > Pagemap is protecting page->pgmap from UAF so we must ensure we hold > it when we do pgmap->ops > > That should be the only put, and it should pair with the only get > which happens when the driver takes a 0 refcount page out of its free > list and makes it have a refcount of 1. > > > page mapping helpers. One of the immediate hurdles is the usage of > > pmd_devmap() to distinguish large page mappings that are not transparent > > huge pages. > > And this is because the struct page refcounting is not right :| > > I had thought the progression would be to make fsdax use compound > folios, install compound folios in the PMD, remove all the special > case refcounting for DAX from the pagetable code, then address the > pgmap issue from the basis of working page->refcount, eg by putting a > pgmap put in right after the op->page_free call. As far as I can see as long as the pgmap is managed at map and free_zone_device_page() time then the large folio conversion can come later. > Can we continue to have the weird page->refcount behavior and still > change the other things? No at a minimum the pgmap vs page->refcount problem needs to be solved first.