On Thu, Aug 03, 2023 at 09:50:35PM +0800, Alan Huang wrote: > > > 2023年8月1日 上午4:06,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > > > > On Sat, Jul 29, 2023 at 11:17:31PM +0800, Alan Huang wrote: > >> > >>> 2023年7月25日 00:09,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > >>> > >>> On Sat, Jul 22, 2023 at 09:02:46AM +0800, Alan Huang wrote: > >>>> > >>>>> 2023年7月22日 上午6:49,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > >>>>> > >>>>> On Wed, Jul 19, 2023 at 12:08:33AM +0800, Alan Huang wrote: > >>>>>> > >>>>>>> 2023年7月18日 03:06,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > >>>>>>> > >>>>>>> On Tue, Jul 18, 2023 at 01:53:10AM +0800, Alan Huang wrote: > >>>>>>>>> 2023年7月18日 00:02,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > >>>>>>>>> On Sun, Jul 16, 2023 at 07:21:28PM +0800, Alan Huang wrote: > >>>>>>>>>>> 2023年7月16日 01:19,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > >>>>>>>>>>> On Sat, Jul 15, 2023 at 08:50:23AM +0800, Alan Huang wrote: > >>>>>>>>>>>>> 2023年7月15日 07:23,Paul E. McKenney <paulmck@xxxxxxxxxx> 写道: > >>>>>>>>>>>>> On Tue, Jul 11, 2023 at 03:09:06PM +0000, Alan Huang wrote: > >>>>>>>>>>>>>> The objects are allocated with SLAB_TYPESAFE_BY_RCU, and there is > >>>>>>>>>>>>>> n->next = first within hlist_add_head_rcu() before rcu_assign_pointer(), > >>>>>>>>>>>>>> which modifies obj->obj_node.next. There may be readers holding the > >>>>>>>>>>>>>> reference of obj in lockless_lookup, and when updater modifies ->next, > >>>>>>>>>>>>>> readers can see the change immediately because ofSLAB_TYPESAFE_BY_RCU. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> There are two memory ordering required in the insertion algorithm, > >>>>>>>>>>>>>> we need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>>>>>> and obj->refcnt, atomic_set_release is not enough to provide the > >>>>>>>>>>>>>> required memory barrier. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Signed-off-by: Alan Huang <mmpgouride@xxxxxxxxx> > >>>>>>>>>>>>> > >>>>>>>>>>>>> This is an interesting one!!! > >>>>>>>>>>>>> > >>>>>>>>>>>>> Now I am having a hard time believing that the smp_rmb() suffices. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> --- > >>>>>>>>>>>>>> Changelog: > >>>>>>>>>>>>>> v1 -> v2: Use _ONCE to protect obj->key. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Documentation/RCU/rculist_nulls.rst | 21 +++++++++++++-------- > >>>>>>>>>>>>>> 1 file changed, 13 insertions(+), 8 deletions(-) > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> diff --git a/Documentation/RCU/rculist_nulls.rst b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>>>> index 21e40fcc08de..2a9f5a63d334 100644 > >>>>>>>>>>>>>> --- a/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>>>> +++ b/Documentation/RCU/rculist_nulls.rst > >>>>>>>>>>>>>> @@ -47,7 +47,7 @@ objects, which is having below type. > >>>>>>>>>>>>>> * reuse these object before the RCU grace period, we > >>>>>>>>>>>>>> * must check key after getting the reference on object > >>>>>>>>>>>>>> */ > >>>>>>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>>>>>> put_ref(obj); > >>>>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>>>> goto begin; > >>>>>>>>>>>>>> @@ -64,10 +64,10 @@ but a version with an additional memory barrier (smp_rmb()) > >>>>>>>>>>>>>> { > >>>>>>>>>>>>>> struct hlist_node *node, *next; > >>>>>>>>>>>>>> for (pos = rcu_dereference((head)->first); > >>>>>>>>>>>>>> - pos && ({ next = pos->next; smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>>>>>> + pos && ({ next = READ_ONCE(pos->next); smp_rmb(); prefetch(next); 1; }) && > >>>>>>>>>>>>> > >>>>>>>>>>>>> Suppose that lockless_lookup() is delayed just before fetching pos->next, > >>>>>>>>>>>>> and that there were 17 more node to search in the list. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Then consider the following sequence of events: > >>>>>>>>>>>>> > >>>>>>>>>>>>> o The updater deletes this same node and kmem_cache_free()s it. > >>>>>>>>>>>>> > >>>>>>>>>>>>> o Another updater kmem_cache_alloc()s that same memory and > >>>>>>>>>>>>> inserts it into an empty hash chain with a different key. > >>>>>>>>>>>>> > >>>>>>>>>>>>> o Then lockless_lookup() fetches pos->next and sees a NULL pointer, > >>>>>>>>>>>>> thus failing to search the remaining 17 nodes in the list, > >>>>>>>>>>>>> one of which had the desired key value. > >>>>>>>>>>>>> > >>>>>>>>>>>>> o The lookup algorithm resumes and sees the NULL return from > >>>>>>>>>>>>> lockless_lookup(), and ends up with a NULL obj. > >>>>>>>>>>>>> > >>>>>>>>>>>>> And this happens even with the strongest possible ordering > >>>>>>>>>>>>> everywhere. > >>>>>>>>>>>>> > >>>>>>>>>>>>> OK, yes, it is late on Friday. So what am I missing here? > >>>>>>>>>>>> > >>>>>>>>>>>> You missed nothing! > >>>>>>>>>>>> > >>>>>>>>>>>> The lockless_lockup should not be a function, but a macro like hlist_for_each_entry_rcu. > >>>>>>>>>>> > >>>>>>>>>>> How would you fix this using a macro? > >>>>>>>>>> > >>>>>>>>>> With additional detection code. A moved object (in another chain) will have a different slot. > >>>>>>>>>> (I have sent patch v3. ) > >>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>>> Independent of that, does hlist_add_head_rcu() need to replace its > >>>>>>>>>>>>> "n->next = first" with "WRITE_ONCE(n->next, first)"? > >>>>>>>>>>>> > >>>>>>>>>>>> I think users who want to use hlist with SLAB_TYPESAFE_BY_RCU should use rculist_nulls? > >>>>>>>>>>> > >>>>>>>>>>> I believe that you are correct. Would you like to propose a patch, or > >>>>>>>>>>> would you rather I put something together? My current thought is to > >>>>>>>>>> > >>>>>>>>>> Feel free to add. > >>>>>>>>>> > >>>>>>>>>> One thing I think would be useful is to tell readers where the ‘next' is. > >>>>>>>>>> The document mentions ’next’ many times, but it’s hard for me, as a reader, to realize that > >>>>>>>>>> the ‘next' is within hlist_add_head_rcu(). (I have no idea where to put the hint.) > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> keep the examples, but to show why the one with smp_rmb() is broken. > >>>>>>>>>> > >>>>>>>>>> I think the example needs to be fixed. :) > >>>>>>>>> > >>>>>>>>> Even better! I will take a look, but in the meantime, would you be > >>>>>>>>> interested in updating the wording to explain how the back-pointer works? > >>>>>>>> > >>>>>>>> Which document needs to be updated? > >>>>>>>> And is there anything that I can refer to? It’s the first time I have ever heard about it. > >>>>>>> > >>>>>>> Documentation/RCU/rculist_nulls.rst, the one that you are updating. > >>>>>>> > >>>>>>> There admittedly isn't a whole lot of commentary. > >>>>>>> > >>>>>>>>> (Looks similar to the is_a_nulls() pointer, but in each element instead of > >>>>>>>>> just at the end. One advantage is the ability to detect a move mid-list, > >>>>>>>>> though that is not a big deal in well-tuned hash tables, which tend to > >>>>>>>>> have short hash chains. The need to move elements to the front of the > >>>>>>>>> destination list remains, though in both cases only if it has been less > >>>>>>>>> than a grace period since the last move.) > >>>>>>>> > >>>>>>>> Looks like that I need to learn it first. :) > >>>>>>> > >>>>>>> Well, you wrote the code, so... ;-) > >>>>>> > >>>>>> If I understand correctly, it works only for 64-bit machines? > >>>>>> > >>>>>> And the number of slots of the hash table will be limited? > >>>>> > >>>>> You are asking about the is_a_nulls() value? If so, it works on both > >>>>> 32-bit and 64-bit machines. They each have enough bits for the nulls > >>>>> value to cover all possible two-byte objects in the full address space. > >>>>> > >>>>> If that wasn't what you were asking, please help me with your question. > >>>> > >>>> I’m asking the ‘back-pointer’, which > >>>> > >>>> "Looks similar to the is_a_nulls() pointer, but in each element instead of > >>>> just at the end” > >>>> > >>>> If I understand correctly, we need to store the nulls in the upper unused bits. > >>>> > >>>> And we only have several bits unused. One advantage I can think of is that it > >>>> will improve the performance once we have something like the Intel Upper Address Ignore (UAI), > >>>> which also works only for 64-bit machine. > >>> > >>> Ah, I was using the word "similar" very loosely. Not "implemented in > >>> a manner similar to is_a_nulls(), but rather "serves roughly the same > >>> function as is_a_nulls()". > >> > >> I got it. > >> > >>> > >>> You should be able to just have the pointer to the bucket in each > >>> element, and just compare for each element. No need for extra bits, > >> > >> It might have the same functionality if we put the nulls value in each element. > > > > There are some strange things that are easier to do with a nulls value > > than with an explicit pointer, but yes. > > > > (For example, if for some odd reason you wanted to group the hash buckets > > into classes within which moving a reader to another bucket was OK. > > Why would you want to do this? I have no idea!) > > Interesting example. > > I’ll try to update the wording to explain how the back-pointer works. > > By the way, would you like a separate patch or a patch V4 containing the content of > V3 of "docs/RCU: Bring back smp_wmb()” and the back-pointer? A V4 containing the V3, please. Thanx, Paul > >>> though UAI, like its pre-existing ARM counterpart, might also help catch > >>> use-after-free bugs when resizing the hash table. > >>> > >>> Thanx, Paul > >>> > >>>>>>>>>>>> I didn’t find a case using hlist with SLAB_TYPESAFE_BY_RCU, but I did find a case using list > >>>>>>>>>>>> with SLAB_TYPESAFE_BY_RCU in drivers/gpu/drm/i915, the driver also doesn’t use _ONCE > >>>>>>>>>>>> on the fields of the objects allocated with SLAB_TYPESAFE_BY_RCU. > >>>>>>>>>>> > >>>>>>>>>>> Feel free to send them a patch, though I cannot speak for their > >>>>>>>>>>> reception of it. > >>>>>>>>>>> > >>>>>>>>>>> Thanx, Paul > >>>>>>>>>>> > >>>>>>>>>>>>> Thanx, Paul > >>>>>>>>>>>>> > >>>>>>>>>>>>>> ({ obj = hlist_entry(pos, typeof(*obj), obj_node); 1; }); > >>>>>>>>>>>>>> pos = rcu_dereference(next)) > >>>>>>>>>>>>>> - if (obj->key == key) > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) > >>>>>>>>>>>>>> return obj; > >>>>>>>>>>>>>> return NULL; > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> @@ -111,8 +111,13 @@ detect the fact that it missed following items in original chain. > >>>>>>>>>>>>>> */ > >>>>>>>>>>>>>> obj = kmem_cache_alloc(...); > >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>>>>>> - obj->key = key; > >>>>>>>>>>>>>> - atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>>>>>> + /* > >>>>>>>>>>>>>> + * We need to make sure obj->key is updated before obj->obj_node.next > >>>>>>>>>>>>>> + * and obj->refcnt. > >>>>>>>>>>>>>> + */ > >>>>>>>>>>>>>> + smp_wmb(); > >>>>>>>>>>>>>> + atomic_set(&obj->refcnt, 1); > >>>>>>>>>>>>>> hlist_add_head_rcu(&obj->obj_node, list); > >>>>>>>>>>>>>> unlock_chain(); // typically a spin_unlock() > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> @@ -165,12 +170,12 @@ Note that using hlist_nulls means the type of 'obj_node' field of > >>>>>>>>>>>>>> begin: > >>>>>>>>>>>>>> rcu_read_lock(); > >>>>>>>>>>>>>> hlist_nulls_for_each_entry_rcu(obj, node, head, obj_node) { > >>>>>>>>>>>>>> - if (obj->key == key) { > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) == key) { > >>>>>>>>>>>>>> if (!try_get_ref(obj)) { // might fail for free objects > >>>>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>>>> goto begin; > >>>>>>>>>>>>>> } > >>>>>>>>>>>>>> - if (obj->key != key) { // not the object we expected > >>>>>>>>>>>>>> + if (READ_ONCE(obj->key) != key) { // not the object we expected > >>>>>>>>>>>>>> put_ref(obj); > >>>>>>>>>>>>>> rcu_read_unlock(); > >>>>>>>>>>>>>> goto begin; > >>>>>>>>>>>>>> @@ -206,7 +211,7 @@ hlist_add_head_rcu(). > >>>>>>>>>>>>>> */ > >>>>>>>>>>>>>> obj = kmem_cache_alloc(cachep); > >>>>>>>>>>>>>> lock_chain(); // typically a spin_lock() > >>>>>>>>>>>>>> - obj->key = key; > >>>>>>>>>>>>>> + WRITE_ONCE(obj->key, key); > >>>>>>>>>>>>>> atomic_set_release(&obj->refcnt, 1); // key before refcnt > >>>>>>>>>>>>>> /* > >>>>>>>>>>>>>> * insert obj in RCU way (readers might be traversing chain) > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> 2.34.1 >