On Wed, Sep 29, 2021 at 09:36:52PM -0300, Jason Gunthorpe wrote: > Why would DAX want to do this in the first place?? This means the > address space zap is much more important that just speeding up > destruction, it is essential for correctness since the PTEs are not > holding refcounts naturally... It is not really for this series to fix, but I think the whole thing is probably racy once you start allowing pte_special pages to be accessed by GUP. If we look at unmapping the PTE relative to GUP fast the important sequence is how the TLB flushing doesn't decrement the page refcount until after it knows any concurrent GUP fast is completed. This is arch specific, eg it could be done async through a call_rcu handler. This ensures that pages can't cross back into the free pool and be reallocated until we know for certain that nobody is walking the PTEs and could potentially take an additional reference on it. The scheme cannot rely on the page refcount being 0 because oce it goes into the free pool it could be immeidately reallocated back to a non-zero refcount. A DAX user that simply does an address space invalidation doesn't sequence itself with any of this mechanism. So we can race with a thread doing GUP fast and another thread re-cycling the page into another use - creating a leakage of the page from one security context to another. This seems to be made worse for the pgmap stuff due to the wonky refcount usage - at least if the refcount had dropped to zero gup fast would be blocked for a time, but even that doesn't happen. In short, I think using pg special for anything that can be returned by gup fast (and maybe even gup!) is racy/wrong. We must have the normal refcount mechanism work for correctness of the recycling flow. I don't know why DAX did this, I think we should be talking about udoing it all of it, not just the wonky refcounting Alistair and Felix are working on, but also the use of MIXEDMAP and pte special for struct page backed memory. Jason