Am 2022-02-15 um 21:01 schrieb Jason Gunthorpe:
On Tue, Feb 15, 2022 at 05:49:07PM -0500, Felix Kuehling wrote:
Userspace does
1) mmap(MAP_PRIVATE) to allocate anon memory
2) something to trigger migration to install a ZONE_DEVICE page
3) munmap()
Who decrements the refcout on the munmap?
When a ZONE_DEVICE page is installed in the PTE is supposed to be
marked as pte_devmap and that disables all the normal page refcounting
during munmap().
fsdax makes this work by working the refcounts backwards, the page is
refcounted while it exists in the driver, when the driver decides to
remove it then unmap_mapping_range() is called to purge it from all
PTEs and then refcount is decrd. munmap/fork/etc don't change the
refcount.
Hmm, that just means, whether or not there are PTEs doesn't really
matter.
Yes, that is the FSDAX model
It should work the same as it does for DEVICE_PRIVATE pages. I'm not sure
where DEVICE_PRIVATE page's refcounts are decremented on unmap, TBH. But I
can't find it in our driver, or in the test_hmm driver for that matter.
It is not the same as DEVICE_PRIVATE because DEVICE_PRIVATE uses swap
entries. The put_page for that case is here:
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
struct zap_details *details)
{
[..]
if (is_device_private_entry(entry) ||
is_device_exclusive_entry(entry)) {
struct page *page = pfn_swap_entry_to_page(entry);
if (unlikely(zap_skip_check_mapping(details, page)))
continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
if (is_device_private_entry(entry))
page_remove_rmap(page, false);
put_page(page);
However the devmap case will return NULL from vm_normal_page() and won't
do the put_page() embedded inside the __tlb_remove_page() in the
pte_present() block in the same function.
After reflecting for awhile, I think Christoph's idea is quite
good. Just make it so you don't set pte_devmap() on your pages and
then lets avoid pte_devmap for all refcount correct ZONE_DEVICE pages.
I'm not sure if pte_devmap is actually set for our DEVICE_COHERENT
pages. As far as I can tell, this comes from a bit in the pfn:
#define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
#define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
...
static inline bool pfn_t_devmap(pfn_t pfn)
{
const u64 flags = PFN_DEV|PFN_MAP;
return (pfn.val & flags) == flags;
}
In the case of DEVICE_COHERENT memory, the pfns correspond to real
physical memory addresses. I don't think they have those PFN_DEV|PFN_MAP
bits set.
Regards,
Felix
Jason