On 11/21/24 7:58 AM, Pavel Begunkov wrote: > On 11/21/24 14:34, Jens Axboe wrote: >> On 11/21/24 7:29 AM, Pavel Begunkov wrote: >>> On 11/21/24 00:52, David Wei wrote: >>>> On 2024-11-20 15:56, Pavel Begunkov wrote: >>>>> On 11/20/24 22:14, David Wei wrote: > ... >>>> There is a Memcache-like workload that has load shedding based on the >>>> time spent doing work. With epoll, the work of reading sockets and >>>> processing a request is done by user, which can decide after some amount >>>> of time to drop the remaining work if it takes too long. With io_uring, >>>> the work of reading sockets is done eagerly inside of task work. If >>>> there is a burst of work, then so much time is spent in task work >>>> reading from sockets that, by the time control returns to user the >>>> timeout has already elapsed. >>> >>> Interesting, it also sounds like instead of an arbitrary 20 we >>> might want the user to feed it to us. Might be easier to do it >>> with the bpf toy not to carve another argument. >> >> David and I did discuss that, and I was not in favor of having an extra >> argument. We really just need some kind of limit to prevent it >> over-running. Arguably that should always be min_events, which we >> already have, but that kind of runs afoul of applications just doing >> io_uring_wait_cqe() and hence asking for 1. That's why the hand wavy >> number exists, which is really no different than other hand wavy numbers >> we have to limit running of "something" - eg other kinds of retries. >> >> Adding another argument to this just again doubles wait logic complexity >> in terms of the API. If it's needed down the line for whatever reason, >> then yeah we can certainly do it, probably via the wait regions. But >> adding it to the generic wait path would be a mistake imho. > > Right, I don't like the idea of a wait argument either, messy and > too advanced of a tuning for it. BPF would be fine as it could be > made as a hint and easily removed if needed. It could also be a per > ring REGISTER based hint, a bit worse but with the same deprecation > argument. Obviously would need a good user / use case first. Sure I agree on that, if you already have BPF integration for other things, then yeah you could add tighter control for it. Even with that, I doubt it'd be useful or meaningful really. Current case seems like a worst case kind of thing, recv task_work is arguably one of the more expensive things that can be done out of task_work. We can revisit down the line if it ever becomes needed. -- Jens Axboe