> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > > On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: >> >>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: >>> >>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: >>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: >>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: >>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: >>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: >>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: >>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: >>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is >>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), >>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the >>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, >>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. >>>>>>>>>> >>>>>>>>>> There are two memory ordering required in the insertion algorithm, >>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the >>>>>>>>>> required memory barrier. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@xxxxxxxxx> >>>>>>>>> >>>>>>>>> This is an interesting one!!! >>>>>>>>> >>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. >>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Changelog: >>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. >>>>>>>>>> >>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- >>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) >>>>>>>>>> >>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 >>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst >>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst >>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. >>>>>>>>>> * reuse these object before the RCU grace period, we >>>>>>>>>> * must check key after getting the reference on object >>>>>>>>>> */ >>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>> put_ref(obj); >>>>>>>>>> rcu_read_unlock(); >>>>>>>>>> goto begin; >>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) >>>>>>>>>> { >>>>>>>>>> struct hlist_node *node, *next; >>>>>>>>>> for (pos = rcu_dereference((head)->first); >>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && >>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && >>>>>>>>> >>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, >>>>>>>>> and that there were 17 more node to search in the list. >>>>>>>>> >>>>>>>>> Then consider the following sequence of events: >>>>>>>>> >>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. >>>>>>>>> >>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and >>>>>>>>> inserts it into an empty hash chain with a different key. >>>>>>>>> >>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, >>>>>>>>> thus failing to search the remaining 17 nodes in the list, >>>>>>>>> one of which had the desired key value. >>>>>>>>> >>>>>>>>> o The lookup algorithm resumes and sees the NULL return from >>>>>>>>> lockless_lookup(), and ends up with a NULL obj. >>>>>>>>> >>>>>>>>> And this happens even with the strongest possible ordering >>>>>>>>> everywhere. >>>>>>>>> >>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? >>>>>>>> >>>>>>>> You missed nothing! >>>>>>>> >>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. >>>>>>> >>>>>>> How would you fix this using a macro? >>>>>> >>>>>> With additional detection code. A moved object (in another chain) will have a different slot. >>>>>> (I have sent patch v3. ) >>>>>> >>>>>>> >>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its >>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? >>>>>>>> >>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? >>>>>>> >>>>>>> I believe that you are correct. Would you like to propose a patch, or >>>>>>> would you rather I put something together? My current thought is to >>>>>> >>>>>> Feel free to add. >>>>>> >>>>>> One thing I think would be useful is to tell readers where the ‘next' is. >>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that >>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) >>>>>> >>>>>> >>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. >>>>>> >>>>>> I think the example needs to be fixed. :) >>>>> >>>>> Even better! I will take a look, but in the meantime, would you be >>>>> interested in updating the wording to explain how the back-pointer works? >>>> >>>> Which document needs to be updated? >>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. >>> >>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. >>> >>> There admittedly isn't a whole lot of commentary. >>> >>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of >>>>> just at the end. One advantage is the ability to detect a move mid-list, >>>>> though that is not a big deal in well-tuned hash tables, which tend to >>>>> have short hash chains. The need to move elements to the front of the >>>>> destination list remains, though in both cases only if it has been less >>>>> than a grace period since the last move.) >>>> >>>> Looks like that I need to learn it first. :) >>> >>> Well, you wrote the code, so... ;-) >> >> If I understand correctly, it works only for 64-bit machines? >> >> And the number of slots of the hash table will be limited? > > You are asking about the is_a_nulls() value? If so, it works on both > 32-bit and 64-bit machines. They each have enough bits for the nulls > value to cover all possible two-byte objects in the full address space. > > If that wasn't what you were asking, please help me with your question. I’m asking the ‘back-pointer’, which "Looks similar to the is_a_nulls() pointer, but in each element instead of just at the end” If I understand correctly, we need to store the nulls in the upper unused bits. And we only have several bits unused. One advantage I can think of is that it will improve the performance once we have something like the Intel Upper Address Ignore (UAI), which also works only for 64-bit machine. > > Thanx, Paul > >>> Thanx, Paul >>> >>>>> Thanx, Paul >>>>> >>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list >>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE >>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. >>>>>>> >>>>>>> Feel free to send them a patch, though I cannot speak for their >>>>>>> reception of it. >>>>>>> >>>>>>> Thanx, Paul >>>>>>> >>>>>>>>> Thanx, Paul >>>>>>>>> >>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); >>>>>>>>>> pos = rcu_dereference(next)) >>>>>>>>>> - if (obj->key == key) >>>>>>>>>> + if (READ_ONCE(obj->key) == key) >>>>>>>>>> return obj; >>>>>>>>>> return NULL; >>>>>>>>>> } >>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. >>>>>>>>>> */ >>>>>>>>>> obj = kmem_cache_alloc(...); >>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>> - obj->key = key; >>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>> + /* >>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next >>>>>>>>>> + * and obj->refcnt. >>>>>>>>>> + */ >>>>>>>>>> + smp_wmb(); >>>>>>>>>> + atomic_set(&obj->refcnt, 1); >>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); >>>>>>>>>> unlock_chain(); // typically a spin_unlock() >>>>>>>>>> >>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of >>>>>>>>>> begin: >>>>>>>>>> rcu_read_lock(); >>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { >>>>>>>>>> - if (obj->key == key) { >>>>>>>>>> + if (READ_ONCE(obj->key) == key) { >>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects >>>>>>>>>> rcu_read_unlock(); >>>>>>>>>> goto begin; >>>>>>>>>> } >>>>>>>>>> - if (obj->key != key) { // not the object we expected >>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected >>>>>>>>>> put_ref(obj); >>>>>>>>>> rcu_read_unlock(); >>>>>>>>>> goto begin; >>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). >>>>>>>>>> */ >>>>>>>>>> obj = kmem_cache_alloc(cachep); >>>>>>>>>> lock_chain(); // typically a spin_lock() >>>>>>>>>> - obj->key = key; >>>>>>>>>> + WRITE_ONCE(obj->key, key); >>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt >>>>>>>>>> /* >>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) >>>>>>>>>> -- >>>>>>>>>> 2.34.1