On Tue, Aug 30, 2022 at 08:53:06PM +0200, David Hildenbrand wrote: > On 30.08.22 20:45, Jason Gunthorpe wrote: > > On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote: > >> ... and looking into the details of TLB flush and GUP-fast interaction > >> nowadays, that case is no longer relevant. A TLB flush is no longer > >> sufficient to stop concurrent GUP-fast ever since we introduced generic > >> RCU GUP-fast. > > > > Yes, we've had RCU GUP fast for a while, and it is more widely used > > now, IIRC. > > > > It has been a bit, but if I remember, GUP fast in RCU mode worked on a > > few principles: > > > > - The PTE page must not be freed without RCU > > - The PTE page content must be convertable to a struct page using the > > usual rules (eg PTE Special) > > - That struct page refcount may go from 0->1 inside the RCU > > - In the case the refcount goes from 0->1 there must be sufficient > > barriers such that GUP fast observing the refcount of 1 will also > > observe the PTE entry has changed. ie before the refcount is > > dropped in the zap it has to clear the PTE entry, the refcount > > decr has to be a 'release' and the refcount incr in gup fast has be > > to be an 'acquire'. > > - The rest of the system must tolerate speculative refcount > > increments from GUP on any random page > > > The basic idea being that if GUP fast obtains a valid reference on a > > page *and* the PTE entry has not changed then everything is fine. > > > > The tricks with TLB invalidation are just a "poor mans" RCU, and > > arguably these days aren't really needed since I think we could make > > everything use real RCU always without penalty if we really wanted. > > > > Today we can create a unique 'struct pagetable_page' as Matthew has > > been doing in other places that guarentees a rcu_head is always > > available for every page used in a page table. Using that we could > > drop the code in the TLB flusher that allocates memory for the > > rcu_head and hopes for the best. (Or even is the common struct page > > rcu_head already guarenteed to exist for pagetable pages now a days?) > > > > IMHO that is the main reason we still have the non-RCU mode at all.. > > > Good, I managed to attract the attention of someone who understands that machinery :) > > While validating whether GUP-fast and PageAnonExclusive code work correctly, > I started looking at the whole RCU GUP-fast machinery. I do have a patch to > improve PageAnonExclusive clearing (I think we're missing memory barriers to > make it work as expected in any possible case), but I also stumbled eventually > over a more generic issue that might need memory barriers. > > Any thoughts whether I am missing something or this is actually missing > memory barriers? I don't like the use of smb_mb very much, I deliberately choose the more modern language of release/acquire because it makes it a lot clearer what barriers are doing.. So, if we dig into it, using what I said above, the atomic refcount is: gup_pte_range() try_grab_folio() try_get_folio() folio_ref_try_add_rcu() folio_ref_add_unless() page_ref_add_unless() atomic_add_unless() So that wants to be an acquire The pairing release is in the page table code that does the put_page, it wants to be an atomic_dec_return() as a release. Now, we go and look at Documentation/atomic_t.txt to try to understand what are the ordering semantics of the atomics we are using and become dazed-confused like me: ORDERING (go read memory-barriers.txt first) -------- - RMW operations that have a return value are fully ordered; - RMW operations that are conditional are unordered on FAILURE, otherwise the above rules apply. Fully ordered primitives are ordered against everything prior and everything subsequent. Therefore a fully ordered primitive is like having an smp_mb() before and an smp_mb() after the primitive. So, I take that to mean that both atomic_add_unless() and atomic_dec_return() are "fully ordered" and "fully ordered" is a super set of acquire/release. Thus, we already have the necessary barriers integrated into the atomic being used. The smb_mb_after_atomic stuff is to be used with atomics that don't return values, there are some examples in the doc Jason