Re: [PATCH v2] docs/RCU: Bring back smp_wmb()

"Paul E. McKenney" <paulmck@xxxxxxxxxx> · Thu, 3 Aug 2023 06:54:45 -0700



On Thu, Aug 03, 2023 at 09:50:35PM +0800, Alan Huang wrote:
> 
> > 2023年8月1日 上午4:06，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> > 
> > On Sat, Jul 29, 2023 at 11:17:31PM +0800, Alan Huang wrote:
> >> 
> >>> 2023年7月25日 00:09，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> >>> 
> >>> On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote:
> >>>> 
> >>>>> 2023年7月22日 上午6:49，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> >>>>> 
> >>>>> On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote:
> >>>>>> 
> >>>>>>> 2023年7月18日 03:06，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> >>>>>>> 
> >>>>>>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote:
> >>>>>>>>> 2023年7月18日 00:02，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> >>>>>>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote:
> >>>>>>>>>>> 2023年7月16日 01:19，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> >>>>>>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote:
> >>>>>>>>>>>>> 2023年7月15日 07:23，Paul E. McKenney <paulmck@xxxxxxxxxx> 写道：
> >>>>>>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote:
> >>>>>>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is
> >>>>>>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(),
> >>>>>>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the
> >>>>>>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next,
> >>>>>>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> There are two memory ordering required in the insertion algorithm,
> >>>>>>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next
> >>>>>>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the
> >>>>>>>>>>>>>> required memory barrier.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@xxxxxxxxx>
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> This is an interesting one!!!
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>> Changelog:
> >>>>>>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key.
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++--------
> >>>>>>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-)
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst
> >>>>>>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644
> >>>>>>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst
> >>>>>>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst
> >>>>>>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type.
> >>>>>>>>>>>>>> * reuse these object before the RCU grace period, we
> >>>>>>>>>>>>>> * must check key after getting the reference on object
> >>>>>>>>>>>>>> */
> >>>>>>>>>>>>>> -    if (obj->key != key) { // not the object we expected
> >>>>>>>>>>>>>> +    if (READ_ONCE(obj->key) != key) { // not the object we expected
> >>>>>>>>>>>>>> put_ref(obj);
> >>>>>>>>>>>>>> rcu_read_unlock();
> >>>>>>>>>>>>>> goto begin;
> >>>>>>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb())
> >>>>>>>>>>>>>> {
> >>>>>>>>>>>>>> struct hlist_node *node, *next;
> >>>>>>>>>>>>>> for (pos = rcu_dereference((head)->first);
> >>>>>>>>>>>>>> -         pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) &&
> >>>>>>>>>>>>>> +         pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) &&
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next,
> >>>>>>>>>>>>> and that there were 17 more node to search in the list.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> Then consider the following sequence of events:
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and
> >>>>>>>>>>>>> inserts it into an empty hash chain with a different key.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer,
> >>>>>>>>>>>>> thus failing to search the remaining 17 nodes in the list,
> >>>>>>>>>>>>> one of which had the desired key value.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> o The lookup algorithm resumes and sees the NULL return from
> >>>>>>>>>>>>> lockless_lookup(), and ends up with a NULL obj.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> And this happens even with the strongest possible ordering
> >>>>>>>>>>>>> everywhere.
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>> OK, yes, it is late on Friday.  So what am I missing here?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> You missed nothing!
> >>>>>>>>>>>> 
> >>>>>>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu.
> >>>>>>>>>>> 
> >>>>>>>>>>> How would you fix this using a macro?
> >>>>>>>>>> 
> >>>>>>>>>> With additional detection code. A moved object (in another chain) will have a different slot.
> >>>>>>>>>> (I have sent patch v3. )
> >>>>>>>>>> 
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its
> >>>>>>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"?
> >>>>>>>>>>>> 
> >>>>>>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls?
> >>>>>>>>>>> 
> >>>>>>>>>>> I believe that you are correct.  Would you like to propose a patch, or
> >>>>>>>>>>> would you rather I put something together?  My current thought is to
> >>>>>>>>>> 
> >>>>>>>>>> Feel free to add. 
> >>>>>>>>>> 
> >>>>>>>>>> One thing I think would be useful is to tell readers where the ‘next' is. 
> >>>>>>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that
> >>>>>>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.)
> >>>>>>>>>> 
> >>>>>>>>>> 
> >>>>>>>>>>> keep the examples, but to show why the one with smp_rmb() is broken.
> >>>>>>>>>> 
> >>>>>>>>>> I think the example needs to be fixed. :)
> >>>>>>>>> 
> >>>>>>>>> Even better!  I will take a look, but in the meantime, would you be
> >>>>>>>>> interested in updating the wording to explain how the back-pointer works?
> >>>>>>>> 
> >>>>>>>> Which document needs to be updated? 
> >>>>>>>> And is there anything that I can refer to? It’s the first time I have ever heard about it.
> >>>>>>> 
> >>>>>>> Documentation/RCU/rculist_nulls.rst, the one that you are updating.
> >>>>>>> 
> >>>>>>> There admittedly isn't a whole lot of commentary.
> >>>>>>> 
> >>>>>>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of
> >>>>>>>>> just at the end.  One advantage is the ability to detect a move mid-list,
> >>>>>>>>> though that is not a big deal in well-tuned hash tables, which tend to
> >>>>>>>>> have short hash chains.  The need to move elements to the front of the
> >>>>>>>>> destination list remains, though in both cases only if it has been less
> >>>>>>>>> than a grace period since the last move.)
> >>>>>>>> 
> >>>>>>>> Looks like that I need to learn it first. :)
> >>>>>>> 
> >>>>>>> Well, you wrote the code, so...  ;-)
> >>>>>> 
> >>>>>> If I understand correctly, it works only for 64-bit machines?
> >>>>>> 
> >>>>>> And the number of slots of the hash table will be limited?
> >>>>> 
> >>>>> You are asking about the is_a_nulls() value?  If so, it works on both
> >>>>> 32-bit and 64-bit machines.  They each have enough bits for the nulls
> >>>>> value to cover all possible two-byte objects in the full address space.
> >>>>> 
> >>>>> If that wasn't what you were asking, please help me with your question.
> >>>> 
> >>>> I’m asking the ‘back-pointer’, which
> >>>> 
> >>>> "Looks similar to the is_a_nulls() pointer, but in each element instead of
> >>>> just at the end”
> >>>> 
> >>>> If I understand correctly, we need to store the nulls in the upper unused bits.
> >>>> 
> >>>> And we only have several bits unused. One advantage I can think of is that it 
> >>>> will improve the performance once we have something like the Intel Upper Address Ignore (UAI),
> >>>> which also works only for 64-bit machine.
> >>> 
> >>> Ah, I was using the word "similar" very loosely.  Not "implemented in
> >>> a manner similar to is_a_nulls(), but rather "serves roughly the same
> >>> function as is_a_nulls()".
> >> 
> >> I got it.
> >> 
> >>> 
> >>> You should be able to just have the pointer to the bucket in each
> >>> element, and just compare for each element.  No need for extra bits,
> >> 
> >> It might have the same functionality if we put the nulls value in each element.
> > 
> > There are some strange things that are easier to do with a nulls value
> > than with an explicit pointer, but yes.
> > 
> > (For example, if for some odd reason you wanted to group the hash buckets
> > into classes within which moving a reader to another bucket was OK.
> > Why would you want to do this?  I have no idea!)
> 
> Interesting example.
> 
> I’ll try to update the wording to explain how the back-pointer works.
> 
> By the way, would you like a separate patch or a patch V4 containing the content of
> V3 of "docs/RCU: Bring back smp_wmb()” and the back-pointer?

A V4 containing the V3, please.

							Thanx, Paul

> >>> though UAI, like its pre-existing ARM counterpart, might also help catch
> >>> use-after-free bugs when resizing the hash table.
> >>> 
> >>> Thanx, Paul
> >>> 
> >>>>>>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list
> >>>>>>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE
> >>>>>>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU.
> >>>>>>>>>>> 
> >>>>>>>>>>> Feel free to send them a patch, though I cannot speak for their
> >>>>>>>>>>> reception of it.
> >>>>>>>>>>> 
> >>>>>>>>>>> Thanx, Paul
> >>>>>>>>>>> 
> >>>>>>>>>>>>> Thanx, Paul
> >>>>>>>>>>>>> 
> >>>>>>>>>>>>>>    ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; });
> >>>>>>>>>>>>>>    pos = rcu_dereference(next))
> >>>>>>>>>>>>>> -      if (obj->key == key)
> >>>>>>>>>>>>>> +      if (READ_ONCE(obj->key) == key)
> >>>>>>>>>>>>>>   return obj;
> >>>>>>>>>>>>>> return NULL;
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain.
> >>>>>>>>>>>>>> */
> >>>>>>>>>>>>>> obj = kmem_cache_alloc(...);
> >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock()
> >>>>>>>>>>>>>> -  obj->key = key;
> >>>>>>>>>>>>>> -  atomic_set_release(&obj->refcnt, 1); // key before refcnt
> >>>>>>>>>>>>>> +  WRITE_ONCE(obj->key, key);
> >>>>>>>>>>>>>> +  /*
> >>>>>>>>>>>>>> +   * We need to make sure obj->key is updated before obj->obj_node.next
> >>>>>>>>>>>>>> +   * and obj->refcnt.
> >>>>>>>>>>>>>> +   */
> >>>>>>>>>>>>>> +  smp_wmb();
> >>>>>>>>>>>>>> +  atomic_set(&obj->refcnt, 1);
> >>>>>>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list);
> >>>>>>>>>>>>>> unlock_chain(); // typically a spin_unlock()
> >>>>>>>>>>>>>> 
> >>>>>>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of
> >>>>>>>>>>>>>> begin:
> >>>>>>>>>>>>>> rcu_read_lock();
> >>>>>>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) {
> >>>>>>>>>>>>>> -    if (obj->key == key) {
> >>>>>>>>>>>>>> +    if (READ_ONCE(obj->key) == key) {
> >>>>>>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects
> >>>>>>>>>>>>>> rcu_read_unlock();
> >>>>>>>>>>>>>>   goto begin;
> >>>>>>>>>>>>>> }
> >>>>>>>>>>>>>> -      if (obj->key != key) { // not the object we expected
> >>>>>>>>>>>>>> +      if (READ_ONCE(obj->key) != key) { // not the object we expected
> >>>>>>>>>>>>>>   put_ref(obj);
> >>>>>>>>>>>>>> rcu_read_unlock();
> >>>>>>>>>>>>>>   goto begin;
> >>>>>>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu().
> >>>>>>>>>>>>>> */
> >>>>>>>>>>>>>> obj = kmem_cache_alloc(cachep);
> >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock()
> >>>>>>>>>>>>>> -  obj->key = key;
> >>>>>>>>>>>>>> +  WRITE_ONCE(obj->key, key);
> >>>>>>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt
> >>>>>>>>>>>>>> /*
> >>>>>>>>>>>>>> * insert obj in RCU way (readers might be traversing chain)
> >>>>>>>>>>>>>> -- 
> >>>>>>>>>>>>>> 2.34.1
>