Hi Alistair, > > > > > > >> >> > > > No, adding HMM_PFN_REQ_WRITE still doesn't help in fixing the > > >> issue. > > >> >> > > > Although, I do not have THP enabled (or built-in), shmem does > > not > > >> evict > > >> >> > > > the pages after hole punch as noted in the comment in > > >> >> shmem_fallocate(): > > >> >> > > > > >> >> > > This is the source of all your problems. > > >> >> > > > > >> >> > > Things that are mm-centric are supposed to track the VMAs and > > >> changes > > >> >> to > > >> >> > > the PTEs. If you do something in userspace and it doesn't cause > the > > >> >> > > CPU page tables to change then it certainly shouldn't cause any > > mmu > > >> >> > > notifiers or hmm_range_fault changes. > > >> >> > I am not doing anything out of the blue in the userspace. I think the > > >> >> behavior > > >> >> > I am seeing with shmem (where an invalidation event > > >> >> (MMU_NOTIFY_CLEAR) > > >> >> > does occur because of a hole punch but the PTEs don't really get > > >> updated) > > >> >> > can arguably be considered an optimization. > > >> >> > > >> >> Your explanations don't make sense. > > >> >> > > >> >> If MMU_NOTIFER_CLEAR was sent but the PTEs were left present > then: > > >> >> > > >> >> > > There should still be an invalidation notifier at some point when > the > > >> >> > > CPU tables do eventually change, whenever that is. Missing that > > >> >> > > notification would be a bug. > > >> >> > I clearly do not see any notification getting triggered (from both > > >> >> shmem_fault() > > >> >> > and hugetlb_fault()) when the PTEs do get updated as the hole is > > refilled > > >> >> > due to writes. Are you saying that there needs to be an invalidation > > >> event > > >> >> > (MMU_NOTIFY_CLEAR?) dispatched at this point? > > >> >> > > >> >> You don't get to get shmem_fault in the first place. > > >> > What I am observing is that even after MMU_NOTIFY_CLEAR (hole > > punch) > > >> is sent, > > >> > hmm_range_fault() finds that the PTEs associated with the hole are still > > >> pte_present(). > > >> > I think it remains this way as long as there are reads on the hole. Once > > >> there are > > >> > writes, it triggers shmem_fault() which results in PTEs getting updated > > but > > >> without > > >> > any notification. > > >> > > >> Oh wait, this is shmem. The read from hmm_range_fault() (assuming > you > > >> specified HMM_PFN_REQ_FAULT) will trigger shmem_fault() due to the > > >> missing PTE. > > > When running one of the udmabuf subtests (introduced in the third patch > > of > > > this series), I see that MMU_NOTIFY_CLEAR is sent when a hole is > punched. > > > As a response, hmm_range_fault() is called from the udmabuf invalidate > > callback, > > > > Actually I'm suprised that works. If you've setup an interval notifier > > and are updating the notifier sequence numbers correctly I would expect > > hmm_range_fault() to return -EBUSY until > > mmu_notifier_invalidate_range_end() is called. > > > > It might be helpful to post the code you're testing with somewhere but > > are you calling mmu_interval_read_begin() to start the critical section > > and mmu_interval_set_seq() to update the sequence in another notifier? > > I'm not at all convinced calling hmm_range_fault() from a notifier can > > be made to work though. Turns out, calling hmm_range_fault() from the invalidate callback was indeed a problem and the reason why new pages were not faulted-in. In other words, it looks like the invalidate callback is not the right place to invoke hmm_range_fault() as the PTEs may not have been cleared. > That could be part of the problem. I mean the way hmm_range_fault() > is invoked from the invalidate callback is probably incorrect as you are > suggesting. Anyway, here is the code I am testing with: > static bool invalidate_udmabuf(struct mmu_interval_notifier *mn, > const struct mmu_notifier_range *range_mn, > unsigned long cur_seq) > { > struct udmabuf_vma_range *range = > container_of(mn, struct udmabuf_vma_range, range_mn); > struct udmabuf *ubuf = range->ubuf; > struct hmm_range hrange = {0}; > unsigned long *pfns, num_pages, timeout; > int i, ret; > > printk("invalidate; start = %lu, end = %lu\n", > range->start, range->end); > > hrange.notifier = mn; > hrange.default_flags = HMM_PFN_REQ_FAULT; > hrange.start = max(range_mn->start, range->start); > hrange.end = min(range_mn->end, range->end); > num_pages = (hrange.end - hrange.start) >> PAGE_SHIFT; > > pfns = kmalloc_array(num_pages, sizeof(*pfns), GFP_KERNEL); > if (!pfns) > return true; > > printk("invalidate; num pages = %lu\n", num_pages); > > hrange.hmm_pfns = pfns; > timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); > do { > hrange.notifier_seq = mmu_interval_read_begin(mn); > > mmap_read_lock(ubuf->vmm_mm); > ret = hmm_range_fault(&hrange); > mmap_read_unlock(ubuf->vmm_mm); > if (ret) { > if (ret == -EBUSY && !time_after(jiffies, timeout)) > continue; > break; > } > > if (mmu_interval_read_retry(mn, hrange.notifier_seq)) > continue; > } while (ret); > > if (!ret) { > for (i = 0; i < num_pages; i++) { > printk("hmm returned page = %p; pfn = %lu\n", > hmm_pfn_to_page(pfns[i]), > pfns[i] & ~HMM_PFN_FLAGS); > } > } > return true; > } > Doing the above from a wq worker func (scheduled after invalidate event) instead of the invalidate callback lets hmm_range_fault() fault-in new pages. What this means is that, at-least in my use-case, getting MMU_NOTIFY_CLEAR indicates that the invalidation is still ongoing and that it is not done yet. Sorry for the confusion. Thanks, Vivek > static const struct mmu_interval_notifier_ops udmabuf_invalidate_ops = { > .invalidate = invalidate_udmabuf, > }; > > > > > > to walk over the PTEs associated with the hole. When this happens, I > > noticed that > > > the below function returns HMM_PFN_VALID | HMM_PFN_WRITE for all > > the > > > PTEs associated with the hole. > > > static inline unsigned long pte_to_hmm_pfn_flags(struct hmm_range > > *range, > > > pte_t pte) > > > { > > > if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte)) > > > return 0; > > > return pte_write(pte) ? (HMM_PFN_VALID | HMM_PFN_WRITE) : > > HMM_PFN_VALID; > > > } > > > > > > As a result, hmm_pte_need_fault() always returns 0 and shmem_fault() > > > never gets triggered despite specifying HMM_PFN_REQ_FAULT | > > HMM_PFN_REQ_WRITE. > > > And, the set of PFNs returned by hmm_range_fault() are the same ones > > > that existed before the hole was punched. > > > > > >> Subsequent writes will just upgrade PTE permissions > > >> assuming the read didn't map them RW to begin with. If you want to > > >> actually see the hole with hmm_range_fault() don't specify > > >> HMM_PFN_REQ_FAULT (or _WRITE). > > >> > > >> >> > > >> >> If they were marked non-prsent during the CLEAR then the shadow > side > > >> >> remains non-present until it gets its own fault. > > >> >> > > >> >> If they were made non-present without an invalidation then that is a > > >> >> bug. > > >> >> > > >> >> > > hmm_range_fault() is the correct API to use if you are working > with > > >> >> > > notifiers. Do not hack something together using pin_user_pages. > > >> >> > > >> >> > I noticed that hmm_range_fault() does not seem to be working as > > >> expected > > >> >> > given that it gets stuck(hangs) while walking hugetlb pages. > > >> >> > > >> >> You are the first to report that, it sounds like a serious bug. Please > > >> >> try to fix it. > > >> >> > > >> >> > Regardless, as I mentioned above, the lack of notification when PTEs > > >> >> > do get updated due to writes is the crux of the issue > > >> >> > here. Therefore, AFAIU, triggering an invalidation event or some > > >> >> > other kind of notification would help in fixing this issue. > > >> >> > > >> >> You seem to be facing some kind of bug in the mm, it sounds pretty > > >> >> serious, and it almost certainly is a missing invalidation. > > >> >> > > >> >> Basically, anything that changes a PTE must eventually trigger an > > >> >> invalidation. It is illegal to change a PTE from one present value to > > >> >> another present value without invalidation notification. > > >> >> > > >> >> It is not surprising something would be missed here. > > >> > As you suggest, it looks like the root-cause of this issue is the missing > > >> > invalidation notification when the PTEs are changed from one present > > >> > > >> I don't think there's a missing invalidation here. You say you're seeing > > >> the MMU_NOTIFY_CLEAR when hole punching which is when the PTE is > > >> cleared. When else do you expect a notification? > > > Oh, given that we are finding PTEs that are still pte_present() even after > > > MMU_NOTIFY_CLEAR is sent, the theory is that another > > MMU_NOTIFY_CLEAR > > > needs to be sent after the PTEs are updated when new pages are faulted- > in. > > > > > > However, it just occurred to me that maybe the behavior I am seeing is not > > > unexpected as it might be a timing issue that has to do with when the > PTEs > > > are walked. Let me explain. Here is what shmem does when a hole is > > punched: > > > if ((u64)unmap_end > (u64)unmap_start) > > > unmap_mapping_range(mapping, unmap_start, > > > 1 + unmap_end - unmap_start, 0); > > > shmem_truncate_range(inode, offset, offset + len - 1); > > > > > > IIUC, the invalidate callback is called from unmap_mapping_range() but > > > the page removal does not happen until shmem_truncate_range() gets > > > called. So, if I were to call hmm_range_fault() after > > shmem_truncate_range(), > > > I might see different results as the PTEs would probably no longer be > > present. > > > In order to test this theory, I would have to schedule a wq thread func > from > > the > > > invalidate callback (to walk the PTEs after a slight delay). I'll try this out > > when > > > I get a chance after addressing some of the locking concerns associated > with > > > pairing static/dynamic dmabuf exporters and importers. > > > > That sounds plausible. The PTE will actually be cleared in > > unmap_mapping_range() after the mmu notifier is called. I'm curious how > > hmm_range_fault() passes though. > > > > > Thanks, > > > Vivek > > > > > >> > > >> > value to another. I'd like to fix this issue eventually but I first need to > > >> > focus on addressing udmabuf page migration (out of movable zone) > > >> > and also look into the locking concerns Daniel mentioned about pairing > > >> > static and dynamic dmabuf exporters and importers. > > >> > > > >> > Thanks, > > >> > Vivek