On 11/21/24 7:29 AM, Pavel Begunkov wrote: > On 11/21/24 00:52, David Wei wrote: >> On 2024-11-20 15:56, Pavel Begunkov wrote: >>> On 11/20/24 22:14, David Wei wrote: > ... >>> One thing that is not so nice is that now we have this handling and >>> checks in the hot path, and __io_run_local_work_loop() most likely >>> gets uninlined. >>> >>> I wonder, can we just requeue it via task_work again? We can even >>> add a variant efficiently adding a list instead of a single entry, >>> i.e. local_task_work_add(head, tail, ...); >> >> That was an early idea, but it means re-reversing the list and then >> atomically adding each node back to work_llist concurrently with e.g. >> io_req_local_work_add(). >> >> Using a separate retry_llist means we don't need to concurrently add to >> either retry_llist or work_llist. >> >>> >>> I'm also curious what's the use case you've got that is hitting >>> the problem? >>> >> >> There is a Memcache-like workload that has load shedding based on the >> time spent doing work. With epoll, the work of reading sockets and >> processing a request is done by user, which can decide after some amount >> of time to drop the remaining work if it takes too long. With io_uring, >> the work of reading sockets is done eagerly inside of task work. If >> there is a burst of work, then so much time is spent in task work >> reading from sockets that, by the time control returns to user the >> timeout has already elapsed. > > Interesting, it also sounds like instead of an arbitrary 20 we > might want the user to feed it to us. Might be easier to do it > with the bpf toy not to carve another argument. David and I did discuss that, and I was not in favor of having an extra argument. We really just need some kind of limit to prevent it over-running. Arguably that should always be min_events, which we already have, but that kind of runs afoul of applications just doing io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy number exists, which is really no different than other hand wavy numbers we have to limit running of "something" - eg other kinds of retries. Adding another argument to this just again doubles wait logic complexity in terms of the API. If it's needed down the line for whatever reason, then yeah we can certainly do it, probably via the wait regions. But adding it to the generic wait path would be a mistake imho. I also strongly suggest this is the last we'll ever hear of this, and for that reason alone I don't think it's worth any kind of extra arguments or added complexity. -- Jens Axboe