On 30.08.22 21:57, Jason Gunthorpe wrote: > On Tue, Aug 30, 2022 at 08:53:06PM +0200, David Hildenbrand wrote: >> On 30.08.22 20:45, Jason Gunthorpe wrote: >>> On Tue, Aug 30, 2022 at 08:23:52PM +0200, David Hildenbrand wrote: >>>> ... and looking into the details of TLB flush and GUP-fast interaction >>>> nowadays, that case is no longer relevant. A TLB flush is no longer >>>> sufficient to stop concurrent GUP-fast ever since we introduced generic >>>> RCU GUP-fast. >>> >>> Yes, we've had RCU GUP fast for a while, and it is more widely used >>> now, IIRC. >>> >>> It has been a bit, but if I remember, GUP fast in RCU mode worked on a >>> few principles: >>> >>> - The PTE page must not be freed without RCU >>> - The PTE page content must be convertable to a struct page using the >>> usual rules (eg PTE Special) >>> - That struct page refcount may go from 0->1 inside the RCU >>> - In the case the refcount goes from 0->1 there must be sufficient >>> barriers such that GUP fast observing the refcount of 1 will also >>> observe the PTE entry has changed. ie before the refcount is >>> dropped in the zap it has to clear the PTE entry, the refcount >>> decr has to be a 'release' and the refcount incr in gup fast has be >>> to be an 'acquire'. >>> - The rest of the system must tolerate speculative refcount >>> increments from GUP on any random page >>>> The basic idea being that if GUP fast obtains a valid reference on a >>> page *and* the PTE entry has not changed then everything is fine. >>> >>> The tricks with TLB invalidation are just a "poor mans" RCU, and >>> arguably these days aren't really needed since I think we could make >>> everything use real RCU always without penalty if we really wanted. >>> >>> Today we can create a unique 'struct pagetable_page' as Matthew has >>> been doing in other places that guarentees a rcu_head is always >>> available for every page used in a page table. Using that we could >>> drop the code in the TLB flusher that allocates memory for the >>> rcu_head and hopes for the best. (Or even is the common struct page >>> rcu_head already guarenteed to exist for pagetable pages now a days?) >>> >>> IMHO that is the main reason we still have the non-RCU mode at all.. >> >> >> Good, I managed to attract the attention of someone who understands that machinery :) >> >> While validating whether GUP-fast and PageAnonExclusive code work correctly, >> I started looking at the whole RCU GUP-fast machinery. I do have a patch to >> improve PageAnonExclusive clearing (I think we're missing memory barriers to >> make it work as expected in any possible case), but I also stumbled eventually >> over a more generic issue that might need memory barriers. >> >> Any thoughts whether I am missing something or this is actually missing >> memory barriers? > > I don't like the use of smb_mb very much, I deliberately choose the > more modern language of release/acquire because it makes it a lot > clearer what barriers are doing.. > > So, if we dig into it, using what I said above, the atomic refcount is: > > gup_pte_range() > try_grab_folio() > try_get_folio() > folio_ref_try_add_rcu() > folio_ref_add_unless() > page_ref_add_unless() > atomic_add_unless() Right, that seems to work as expected for checking the refcount after clearing the PTE. Unfortunately, it's not sufficien to identify whether a page may be pinned, because the flow continues as folio = try_get_folio(page, refs) ... if (folio_test_large(folio)) atomic_add(refs, folio_pincount_ptr(folio)); else folio_ref_add(folio, ...) So I guess we'd need a smb_mb() before re-checking the PTE for that case. > > So that wants to be an acquire > > The pairing release is in the page table code that does the put_page, > it wants to be an atomic_dec_return() as a release. > > Now, we go and look at Documentation/atomic_t.txt to try to understand > what are the ordering semantics of the atomics we are using and become > dazed-confused like me: I read that 3 times and got dizzy. Thanks for double-checking, very much appreciated! > > ORDERING (go read memory-barriers.txt first) > -------- > > - RMW operations that have a return value are fully ordered; > > - RMW operations that are conditional are unordered on FAILURE, > otherwise the above rules apply. > > Fully ordered primitives are ordered against everything prior and everything > subsequent. Therefore a fully ordered primitive is like having an smp_mb() > before and an smp_mb() after the primitive. > > So, I take that to mean that both atomic_add_unless() and > atomic_dec_return() are "fully ordered" and "fully ordered" is a super > set of acquire/release. > > Thus, we already have the necessary barriers integrated into the > atomic being used. > > The smb_mb_after_atomic stuff is to be used with atomics that don't > return values, there are some examples in the doc Yes, I missed the point that RMW operations that return a value are fully ordered and imply smp_mb() before / after. -- Thanks, David / dhildenb