On 2024-11-21 07:07, Pavel Begunkov wrote: > On 11/21/24 14:31, Jens Axboe wrote: >> On 11/21/24 7:25 AM, Pavel Begunkov wrote: >>> On 11/21/24 01:12, Jens Axboe wrote: >>>> On 11/20/24 4:56 PM, Pavel Begunkov wrote: >>>>> On 11/20/24 22:14, David Wei wrote: > ... >>>> I think that can only work if we change work_llist to be a regular list >>>> with regular locking. Otherwise it's a bit of a mess with the list being >>> >>> Dylan once measured the overhead of locks vs atomics in this >>> path for some artificial case, we can pull the numbers up. >> >> I did it more recently if you'll remember, actually posted a patch I >> think a few months ago changing it to that. But even that approach adds > > Right, and it's be a separate topic from this set. > >> extra overhead, if you want to add it to the same list as now you need > > Extra overhead to the retry path, which is not the hot path, > and coldness of it is uncertain. > >> to re-grab (and re-disable interrupts) the lock to add it back. My gut >> says that would be _worse_ than the current approach. And if you keep a >> separate list instead, well then you're back to identical overhead in >> terms of now needing to check both when needing to know if anything is >> pending, and checking both when running it. >> >>>> reordered, and then you're spending extra cycles on potentially >>>> reordering all the entries again. >>> >>> That sucks, I agree, but then it's same question of how often >>> it happens. >> >> At least for now, there's a real issue reported and we should fix it. I >> think the current patches are fine in that regard. That doesn't mean we >> can't potentially make it better, we should certainly investigate that. >> But I don't see the current patches as being suboptimal really, they are >> definitely good enough as-is for solving the issue. > > That's fair enough, but I still would love to know how frequent > it is. There is no purpose in optimising it as hot/slow path if > it triggers every fifth run or such. David, how easy it is to > get some stats? We can hack up some bpftrace script > Here is a sample distribution of how many task work is done per __io_run_local_work(): @work_done: [1] 15385954 |@ | [2, 4) 33424809 |@@@@ | [4, 8) 196055270 |@@@@@@@@@@@@@@@@@@@@@@@@ | [8, 16) 419060191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [16, 32) 48395043 |@@@@@@ | [32, 64) 1573469 | | [64, 128) 98151 | | [128, 256) 14288 | | [256, 512) 2035 | | [512, 1K) 268 | | [1K, 2K) 13 | | This workload had wait_nr set to 20 and the timeout set to 500 µs. Empirically, I know that any task work done > 50 will violate the latency limit for this workload. In these cases, all the requests must be dropped. So even if excessive task work happens in a small % of time, the impact is far larger than this.