Re: [PATCH] io_uring/napi: remove duplicate io_napi_entry timeout assignation

Jens Axboe <axboe@xxxxxxxxx> · Mon, 12 Aug 2024 14:40:47 -0600

On 8/12/24 2:15 PM, Olivier Langlois wrote:
> On Mon, 2024-08-12 at 12:11 -0600, Jens Axboe wrote:
>> On 8/12/24 12:10 PM, Jens Axboe wrote:
>>> On 8/11/24 7:00 PM, Olivier Langlois wrote:
>>>> On Sun, 2024-08-11 at 20:34 -0400, Olivier Langlois wrote:
>>>>> io_napi_entry() has 2 calling sites. One of them is unlikely to
>>>>> find
>>>>> an
>>>>> entry and if it does, the timeout should arguable not be
>>>>> updated.
>>>>>
>>>>> The other io_napi_entry() calling site is overwriting the
>>>>> update made
>>>>> by io_napi_entry() so the io_napi_entry() timeout value update
>>>>> has no
>>>>> or
>>>>> little value and therefore is removed.
>>>>>
>>>>> Signed-off-by: Olivier Langlois <olivier@xxxxxxxxxxxxxx>
>>>>> ---
>>>>>  io_uring/napi.c | 1 -
>>>>>  1 file changed, 1 deletion(-)
>>>>>
>>>>> diff --git a/io_uring/napi.c b/io_uring/napi.c
>>>>> index 73c4159e8405..1de1d4d62925 100644
>>>>> --- a/io_uring/napi.c
>>>>> +++ b/io_uring/napi.c
>>>>> @@ -26,7 +26,6 @@ static struct io_napi_entry
>>>>> *io_napi_hash_find(struct hlist_head *hash_list,
>>>>>  	hlist_for_each_entry_rcu(e, hash_list, node) {
>>>>>  		if (e->napi_id != napi_id)
>>>>>  			continue;
>>>>> -		e->timeout = jiffies + NAPI_TIMEOUT;
>>>>>  		return e;
>>>>>  	}
>>>>>  
>>>> I am commenting my own patch because I found something curious
>>>> that I
>>>> was not sure about when I was reviewing the code.
>>>>
>>>> Should the remaining e->timeout assignation be wrapped with a
>>>> WRITE_ONCE() macro to ensure an atomic store?
>>>
>>> I think that makes sense to do as lookup can be within rcu, and
>>> hence we have nothing serializing it. Not for torn writes, but to
>>> ensure that the memory sanitizer doesn't complain. I can just make
>>> this change while applying, or send a v2.
>>
>> As a separate patch I mean, not a v2. That part can wait until 6.12.
>>
> ok. np. I'll look into it soon.
> 
> In the meantime, I have detected few suspicious things in the napi
> code.
> 
> I am reporting them here to have few extra eye balls looking at them to
> be sure that everything is fine or not.
> 
> 1. in __io_napi_remove_stale(),
> 
> is it ok to use hash_for_each() instead of hash_for_each_safe()?
> 
> it might be ok because it is a hash_del_rcu() and not a simple
> hash_del() but I have little experience with possible RCU shortcuts so
> I am unsure on this one...

Should use hash_for_each_rcu(), and I think for good measure, we sould
just keep it inside the RCU region. Then we know for a fact that the
deletion doesn't run.

> 2. in io_napi_free()
> 
> list_del(&e->list); is not called. Can the only reason be that
> io_napi_free() is called as part of the ring destruction so it is an
> optimization to not clear the list since it is not expected to be
> reused?
> 
> would calling INIT_LIST_HEAD() before exiting as an extra precaution to
> make the function is future proof in case it is reused in another
> context than the ring destruction be a good idea?

I think that's just an oversight, and doesn't matter since it's all
going away anyway. But it would be prudent to delete it regardless!

> 3. I am surprised to notice that in __io_napi_do_busy_loop(),
> list_for_each_entry_rcu() is called to traverse the list but the
> regular methods list_del() and list_add_tail() are called to update the
> list instead of their RCU variant.

Should all just use rcu variants.

Here's a mashup of the changes. Would be great if you can test - I'll do
some too, but always good with more than one person testing as it tends
to hit more cases.

diff --git a/io_uring/napi.c b/io_uring/napi.c
index d0cf694d0172..6251111a7e1f 100644
--- a/io_uring/napi.c
+++ b/io_uring/napi.c
@@ -81,7 +81,7 @@ void __io_napi_add(struct io_ring_ctx *ctx, struct socket *sock)
 	}
 
 	hlist_add_tail_rcu(&e->node, hash_list);
-	list_add_tail(&e->list, &ctx->napi_list);
+	list_add_tail_rcu(&e->list, &ctx->napi_list);
 	spin_unlock(&ctx->napi_lock);
 }
 
@@ -91,9 +91,9 @@ static void __io_napi_remove_stale(struct io_ring_ctx *ctx)
 	unsigned int i;
 
 	spin_lock(&ctx->napi_lock);
-	hash_for_each(ctx->napi_ht, i, e, node) {
+	hash_for_each_rcu(ctx->napi_ht, i, e, node) {
 		if (time_after(jiffies, e->timeout)) {
-			list_del(&e->list);
+			list_del_rcu(&e->list);
 			hash_del_rcu(&e->node);
 			kfree_rcu(e, rcu);
 		}
@@ -174,9 +174,8 @@ static void io_napi_blocking_busy_loop(struct io_ring_ctx *ctx,
 	do {
 		is_stale = __io_napi_do_busy_loop(ctx, loop_end_arg);
 	} while (!io_napi_busy_loop_should_end(iowq, start_time) && !loop_end_arg);
-	rcu_read_unlock();
-
 	io_napi_remove_stale(ctx, is_stale);
+	rcu_read_unlock();
 }
 
 /*
@@ -209,6 +208,7 @@ void io_napi_free(struct io_ring_ctx *ctx)
 	spin_lock(&ctx->napi_lock);
 	hash_for_each(ctx->napi_ht, i, e, node) {
 		hash_del_rcu(&e->node);
+		list_del_rcu(&e->list);
 		kfree_rcu(e, rcu);
 	}
 	spin_unlock(&ctx->napi_lock);
@@ -309,9 +309,8 @@ int io_napi_sqpoll_busy_poll(struct io_ring_ctx *ctx)
 
 	rcu_read_lock();
 	is_stale = __io_napi_do_busy_loop(ctx, NULL);
-	rcu_read_unlock();
-
 	io_napi_remove_stale(ctx, is_stale);
+	rcu_read_unlock();
 	return 1;
 }
 

-- 
Jens Axboe