On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: > In this patch series, we add a pte_refcount field to the struct page of page > table to track how many users of PTE page table. Similar to the mechanism of > page refcount, the user of PTE page table should hold a refcount to it before > accessing. The PTE page table page will be freed when the last refcount is > dropped. So, this approach basically adds two atomics on every PTE map If I have it right the reason that zap cannot clean the PTEs today is because zap cannot obtain the mmap lock due to a lock ordering issue with the inode lock vs mmap lock. If it could obtain the mmap lock then it could do the zap using the write side as unmapping a vma does. Rather than adding a new "lock" to ever PTE I wonder if it would be more efficient to break up the mmap lock and introduce a specific rwsem for the page table itself, in addition to the PTL. Currently the mmap lock is protecting both the vma list and the page table. I think that would allow the lock ordering issue to be resolved and zap could obtain a page table rwsem. Compared to two atomics per PTE this would just be two atomic per page table walk operation, it is conceptually a lot simpler, and would allow freeing all the page table levels, not just PTEs. ? Jason