"Kasireddy, Vivek" <vivek.kasireddy@xxxxxxxxx> writes: > Hi Alistair, > >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the >> issue. >> >> > > > Although, I do not have THP enabled (or built-in), shmem does not >> evict >> >> > > > the pages after hole punch as noted in the comment in >> >> shmem_fallocate(): >> >> > > >> >> > > This is the source of all your problems. >> >> > > >> >> > > Things that are mm-centric are supposed to track the VMAs and >> changes >> >> to >> >> > > the PTEs. If you do something in userspace and it doesn't cause the >> >> > > CPU page tables to change then it certainly shouldn't cause any mmu >> >> > > notifiers or hmm_range_fault changes. >> >> > I am not doing anything out of the blue in the userspace. I think the >> >> behavior >> >> > I am seeing with shmem (where an invalidation event >> >> (MMU_NOTIFY_CLEAR) >> >> > does occur because of a hole punch but the PTEs don't really get >> updated) >> >> > can arguably be considered an optimization. >> >> >> >> Your explanations don't make sense. >> >> >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present then: >> >> >> >> > > There should still be an invalidation notifier at some point when the >> >> > > CPU tables do eventually change, whenever that is. Missing that >> >> > > notification would be a bug. >> >> > I clearly do not see any notification getting triggered (from both >> >> shmem_fault() >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is refilled >> >> > due to writes. Are you saying that there needs to be an invalidation >> event >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point? >> >> >> >> You don't get to get shmem_fault in the first place. >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole punch) >> is sent, >> > hmm_range_fault() finds that the PTEs associated with the hole are still >> pte_present(). >> > I think it remains this way as long as there are reads on the hole. Once >> there are >> > writes, it triggers shmem_fault() which results in PTEs getting updated but >> without >> > any notification. >> >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming you >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the >> missing PTE. > When running one of the udmabuf subtests (introduced in the third patch of > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is punched. > As a response, hmm_range_fault() is called from the udmabuf invalidate callback, Actually I'm suprised that works. If you've setup an interval notifier and are updating the notifier sequence numbers correctly I would expect hmm_range_fault() to return -EBUSY until mmu_notifier_invalidate_range_end() is called. It might be helpful to post the code you're testing with somewhere but are you calling mmu_interval_read_begin() to start the critical section and mmu_interval_set_seq() to update the sequence in another notifier? I'm not at all convinced calling hmm_range_fault() from a notifier can be made to work though. > to walk over the PTEs associated with the hole. When this happens, I noticed that > the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all the > PTEs associated with the hole. > static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range *range, > pte_t pte) > { > if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte)) > return 0; > return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : HMM_PFN_VALID; > } > > As a result, hmm_pte_need_fault() always returns 0 and shmem_fault() > never gets triggered despite specifying HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE. > And, the set of PFNs returned by hmm_range_fault() are the same ones > that existed before the hole was punched. > >> Subsequent writes will just upgrade PTE permissions >> assuming the read didn't map them RW to begin with. If you want to >> actually see the hole with hmm_range_fault() don't specify >> HMM_PFN_REQ_FAULT (or _WRITE). >> >> >> >> >> If they were marked non-prsent during the CLEAR then the shadow side >> >> remains non-present until it gets its own fault. >> >> >> >> If they were made non-present without an invalidation then that is a >> >> bug. >> >> >> >> > > hmm_range_fault() is the correct API to use if you are working with >> >> > > notifiers. Do not hack something together using pin_user_pages. >> >> >> >> > I noticed that hmm_range_fault() does not seem to be working as >> expected >> >> > given that it gets stuck(hangs) while walking hugetlb pages. >> >> >> >> You are the first to report that, it sounds like a serious bug. Please >> >> try to fix it. >> >> >> >> > Regardless, as I mentioned above, the lack of notification when PTEs >> >> > do get updated due to writes is the crux of the issue >> >> > here. Therefore, AFAIU, triggering an invalidation event or some >> >> > other kind of notification would help in fixing this issue. >> >> >> >> You seem to be facing some kind of bug in the mm, it sounds pretty >> >> serious, and it almost certainly is a missing invalidation. >> >> >> >> Basically, anything that changes a PTE must eventually trigger an >> >> invalidation. It is illegal to change a PTE from one present value to >> >> another present value without invalidation notification. >> >> >> >> It is not surprising something would be missed here. >> > As you suggest, it looks like the root-cause of this issue is the missing >> > invalidation notification when the PTEs are changed from one present >> >> I don't think there's a missing invalidation here. You say you're seeing >> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is >> cleared. When else do you expect a notification? > Oh, given that we are finding PTEs that are still pte_present() even after > MMU_NOTIFY_CLEAR is sent, the theory is that another MMU_NOTIFY_CLEAR > needs to be sent after the PTEs are updated when new pages are faulted-in. > > However, it just occurred to me that maybe the behavior I am seeing is not > unexpected as it might be a timing issue that has to do with when the PTEs > are walked. Let me explain. Here is what shmem does when a hole is punched: > if ((u64)unmap_end > (u64)unmap_start) > unmap_mapping_range(mapping, unmap_start, > 1 + unmap_end - unmap_start, 0); > shmem_truncate_range(inode, offset, offset + len - 1); > > IIUC, the invalidate callback is called from unmap_mapping_range() but > the page removal does not happen until shmem_truncate_range() gets > called. So, if I were to call hmm_range_fault() after shmem_truncate_range(), > I might see different results as the PTEs would probably no longer be present. > In order to test this theory, I would have to schedule a wq thread func from the > invalidate callback (to walk the PTEs after a slight delay). I'll try this out when > I get a chance after addressing some of the locking concerns associated with > pairing static/dynamic dmabuf exporters and importers. That sounds plausible. The PTE will actually be cleared in unmap_mapping_range() after the mmu notifier is called. I'm curious how hmm_range_fault() passes though. > Thanks, > Vivek > >> >> > value to another. I'd like to fix this issue eventually but I first need to >> > focus on addressing udmabuf page migration (out of movable zone) >> > and also look into the locking concerns Daniel mentioned about pairing >> > static and dynamic dmabuf exporters and importers. >> > >> > Thanks, >> > Vivek