RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

"Kasireddy, Vivek" <vivek.kasireddy@xxxxxxxxx> · Tue, 22 Aug 2023 06:14:43 +0000

Hi Alistair,

> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the
> issue.
> >> > > > Although, I do not have THP enabled (or built-in), shmem does not
> evict
> >> > > > the pages after hole punch as noted in the comment in
> >> shmem_fallocate():
> >> > >
> >> > > This is the source of all your problems.
> >> > >
> >> > > Things that are mm-centric are supposed to track the VMAs and
> changes
> >> to
> >> > > the PTEs. If you do something in userspace and it doesn't cause the
> >> > > CPU page tables to change then it certainly shouldn't cause any mmu
> >> > > notifiers or hmm_range_fault changes.
> >> > I am not doing anything out of the blue in the userspace. I think the
> >> behavior
> >> > I am seeing with shmem (where an invalidation event
> >> (MMU_NOTIFY_CLEAR)
> >> > does occur because of a hole punch but the PTEs don't really get
> updated)
> >> > can arguably be considered an optimization.
> >>
> >> Your explanations don't make sense.
> >>
> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then:
> >>
> >> > > There should still be an invalidation notifier at some point when the
> >> > > CPU tables do eventually change, whenever that is. Missing that
> >> > > notification would be a bug.
> >> > I clearly do not see any notification getting triggered (from both
> >> shmem_fault()
> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled
> >> > due to writes. Are you saying that there needs to be an invalidation
> event
> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point?
> >>
> >> You don't get to get shmem_fault in the first place.
> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch)
> is sent,
> > hmm_range_fault() finds that the PTEs associated with the hole are still
> pte_present().
> > I think it remains this way as long as there are reads on the hole. Once
> there are
> > writes, it triggers shmem_fault() which results in PTEs getting updated but
> without
> > any notification.
> 
> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you
> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the
> missing PTE. 
When running one of the udmabuf subtests (introduced in the third patch of
this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched.
As a response, hmm_range_fault() is called from the udmabuf invalidate callback,
to walk over the PTEs associated with the hole. When this happens, I noticed that
the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all the
PTEs associated with the hole. 
static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range,
                                                 pte_t pte)
{
        if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
                return 0;
        return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID;
}

As a result, hmm_pte_need_fault() always returns 0 and shmem_fault()
never gets triggered despite specifying HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE.
And, the set of PFNs returned by hmm_range_fault() are the same ones
that existed before the hole was punched.

> Subsequent writes will just upgrade PTE permissions
> assuming the read didn't map them RW to begin with. If you want to
> actually see the hole with hmm_range_fault() don't specify
> HMM_PFN_REQ_FAULT (or _WRITE).
> 
> >>
> >> If they were marked non-prsent during the CLEAR then the shadow side
> >> remains non-present until it gets its own fault.
> >>
> >> If they were made non-present without an invalidation then that is a
> >> bug.
> >>
> >> > > hmm_range_fault() is the correct API to use if you are working with
> >> > > notifiers. Do not hack something together using pin_user_pages.
> >>
> >> > I noticed that hmm_range_fault() does not seem to be working as
> expected
> >> > given that it gets stuck(hangs) while walking hugetlb pages.
> >>
> >> You are the first to report that, it sounds like a serious bug. Please
> >> try to fix it.
> >>
> >> > Regardless, as I mentioned above, the lack of notification when PTEs
> >> > do get updated due to writes is the crux of the issue
> >> > here. Therefore, AFAIU, triggering an invalidation event or some
> >> > other kind of notification would help in fixing this issue.
> >>
> >> You seem to be facing some kind of bug in the mm, it sounds pretty
> >> serious, and it almost certainly is a missing invalidation.
> >>
> >> Basically, anything that changes a PTE must eventually trigger an
> >> invalidation. It is illegal to change a PTE from one present value to
> >> another present value without invalidation notification.
> >>
> >> It is not surprising something would be missed here.
> > As you suggest, it looks like the root-cause of this issue is the missing
> > invalidation notification when the PTEs are changed from one present
> 
> I don't think there's a missing invalidation here. You say you're seeing
> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is
> cleared. When else do you expect a notification?
Oh, given that we are finding PTEs that are still pte_present() even after
MMU_NOTIFY_CLEAR is sent, the theory is that another MMU_NOTIFY_CLEAR
needs to be sent after the PTEs are updated when new pages are faulted-in.

However, it just occurred to me that maybe the behavior I am seeing is not
unexpected as it might be a timing issue that has to do with when the PTEs
are walked. Let me explain. Here is what shmem does when a hole is punched:
                if ((u64)unmap_end > (u64)unmap_start)
                        unmap_mapping_range(mapping, unmap_start,
                                            1 + unmap_end - unmap_start, 0);
                shmem_truncate_range(inode, offset, offset + len - 1);

IIUC, the invalidate callback is called from unmap_mapping_range() but
the page removal does not happen until shmem_truncate_range() gets
called. So, if I were to call hmm_range_fault() after shmem_truncate_range(),
I might see different results as the PTEs would probably no longer be present.
In order to test this theory, I would have to schedule a wq thread func from the
invalidate callback (to walk the PTEs after a slight delay). I'll try this out when
I get a chance after addressing some of the locking concerns associated with
pairing static/dynamic dmabuf exporters and importers.

Thanks,
Vivek

> 
> > value to another. I'd like to fix this issue eventually but I first need to
> > focus on addressing udmabuf page migration (out of movable zone)
> > and also look into the locking concerns Daniel mentioned about pairing
> > static and dynamic dmabuf exporters and importers.
> >
> > Thanks,
> > Vivek