On Wednesday, 16 June 2021 2:25:09 AM AEST Peter Xu wrote: > On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote: > > On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote: > > > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote: [...] > > > Do you think we can restore pte right before wr-protect or zap? Then all > > > things serializes with page lock (btw: it's already an insane userspace to > > > either unmap a page or wr-protect a page if it knows the device is using it!). > > > If these are the only two cases, it still sounds a cleaner approach to me than > > > the current approach. > > > > Perhaps we could but it would make {zap|change}_pte_range() much more complex as > > we can't sleep taking the page lock whilst holding the ptl, so we'd have to > > implement a retry scheme similar to copy_pte_range() in both those functions as > > well. > > Yes, but shouldn't be hard to do so, imho. E.g., see when __tlb_remove_page() > returns true in zap_pte_range(), so we already did something like that. IMHO > it's not uncommon to have such facilities as we do have requirements to sleep > during a spinlock critical section for a lot of places in mm, so we release > them when needed and retake. Agreed, it's not hard to do and it's a common enough pattern. However we decided that for such a specific application this (trying to take the lock or drop locks and retry) was too complex for copy_pte_range() so it seems like the same should apply here. Admittedly copy_pte_range() already had several other retry paths so perhaps it was adding yet another that made it relatively more complex. Overall I have been trying to minimise the impact on core mm code for this feature, and adding this pattern to zap_pte_range(), etc. would make it more complex for any future addition that requires locks to be dropped and retried so I guess in that sense it is no different. > > Given mmu_interval_read_begin/retry was IMHO added to solve this type of > > problem (freezing pte's to safely program device pte's) it seems like the > > better option rather than adding more complex code to generic mm paths. > > > > It's also worth noting i915 seems to use mmu_interval_read_begin/retry() with > > gup to sync mappings so this isn't an entirely new concept. I'm not an expert > > in that driver but I imagine changing gup to generate unconditional mmu notifier > > invalidates would also cause issues there. So I think overall this is the > > cleanest solution as it reduces the amount of code (particularly in generic mm > > paths). > > I could be wrong somewhere, but to me depending on mmu notifiers being > "accurate" in general is fragile.. > > Take an example of change_pte_range(), which will generate PROTECTION_VMA > notifies. Let's imaging an userspace calls mprotect() e.g. twice or even more > times with the same PROT_* and upon the same region, we know very possibly the > 2nd,3rd,... calls will generate those notifies with totally no change to the > pgtable at all as they're all done on the 1st shot. However we'll generate mmu > notifies anyways for the 2nd,3rd,... calls. It means mmu notifiers should > really be tolerant of false positives as it does happen, and such thing can be > triggered even from userspace system calls very easily like this. That's why I > think any kernel facility that depends on mmu notifiers being accurate is > probably not the right approach.. Argh, thanks. I was focused on the specifics of this series but I think I understand your point better now - that as a more general principle we can't assume notifiers are accurate. > But yeah as you said I think it's working as is with the series (I think the > follow_pmd_mask() checking pmd_trans_huge before calling split_huge_pmd is a > double safety-net for it, so even if the GUP split_huge_pmd got replaced with > __split_huge_pmd it should still work with the one-retry logic), not sure > whether it matters a lot, as it's not common mm path; I think I'll step back so > Andrew could still pick it up as wish, I'm just still not fully convinced it's > the best solution to have for a long term to depend on that.. Ok, thanks. I guess you have somewhat convinced me - depending on it for the long term might be a bit fragile. However as you say the current implementation does work and I am starting to look at support for PMD and file backed pages which require changes here anyway. So I am hoping Andrew can still take this (once rebased) as it would be easier for me to do those changes if the basic support and clean ups were already in place. > > > This also reminded me that right now the cpu pgtable recovery is lazy - it > > > happens either from fork() or a cpu page fault. Even after device finished > > > using it, swap ptes keep there. > > > > > > What if the device tries to do atomic op on the same page twice? I am not sure > > > whether it means we may also want to teach both GUP (majorly follow_page_pte() > > > for now before pmd support) and process of page_make_device_exclusive() with > > > understanding the device exclusive entries too? Another option seems to be > > > restoring pte after device finish using it, as long as the device knows when. > > > > I don't think we need to complicate follow_page_pte() with knowledge of > > exclusive entries. GUP will just restore the original pte via the normal > > fault path - follow_page_pte() will return NULL for an exclusive entry, > > resulting in handle_mm_path() getting called via faultin_page(). Therefore > > a driver calling make_device_exclusive() twice on the same page won't cause an > > issue. Also the device shouldn't fault on subsequent accesses if the exclusive > > entry is still in place anyway. > > Right, looks good then. > > > > > We can't restore the pte when the device is finished with it because there is > > no way of knowing when a device is done using an exclusive entry - device > > pte's work much the same as cpu pte's in that regard. > > I see, I feel like I understand how it works slightly better now, thanks. Feel free to ask if there are any more details you want, but there's nothing too magical going on here. > One last pure question: I see nouveau_atomic_range_fault() will call the other > nvif_object_ioctl() which seems to do the device pgtable mapping, am I right? Correct - that installs the page table mapping on the GPU. > Then I see the notifier is quickly removed before nouveau_atomic_range_fault() > returns. What happens if CPU access happens after mmu notifier removed? Or is > it not possible to happen? So there are two notifiers registered - this one and a process wide notifier (see nouveau_mn_ops). In this case the process wide notifier will get called to invalidate the access when the CPU fault removes the device exclusive entries. - Alistair > -- > Peter Xu >