On 11/21/24 17:53, David Wei wrote:
On 2024-11-21 07:07, Pavel Begunkov wrote:
On 11/21/24 14:31, Jens Axboe wrote:
On 11/21/24 7:25 AM, Pavel Begunkov wrote:
On 11/21/24 01:12, Jens Axboe wrote:
On 11/20/24 4:56 PM, Pavel Begunkov wrote:
On 11/20/24 22:14, David Wei wrote:
...
I think that can only work if we change work_llist to be a regular list
with regular locking. Otherwise it's a bit of a mess with the list being
Dylan once measured the overhead of locks vs atomics in this
path for some artificial case, we can pull the numbers up.
I did it more recently if you'll remember, actually posted a patch I
think a few months ago changing it to that. But even that approach adds
Right, and it's be a separate topic from this set.
extra overhead, if you want to add it to the same list as now you need
Extra overhead to the retry path, which is not the hot path,
and coldness of it is uncertain.
to re-grab (and re-disable interrupts) the lock to add it back. My gut
says that would be _worse_ than the current approach. And if you keep a
separate list instead, well then you're back to identical overhead in
terms of now needing to check both when needing to know if anything is
pending, and checking both when running it.
reordered, and then you're spending extra cycles on potentially
reordering all the entries again.
That sucks, I agree, but then it's same question of how often
it happens.
At least for now, there's a real issue reported and we should fix it. I
think the current patches are fine in that regard. That doesn't mean we
can't potentially make it better, we should certainly investigate that.
But I don't see the current patches as being suboptimal really, they are
definitely good enough as-is for solving the issue.
That's fair enough, but I still would love to know how frequent
it is. There is no purpose in optimising it as hot/slow path if
it triggers every fifth run or such. David, how easy it is to
get some stats? We can hack up some bpftrace script
Here is a sample distribution of how many task work is done per
__io_run_local_work():
@work_done:
[1] 15385954 |@ |
[2, 4) 33424809 |@@@@ |
[4, 8) 196055270 |@@@@@@@@@@@@@@@@@@@@@@@@ |
[8, 16) 419060191 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16, 32) 48395043 |@@@@@@ |
[32, 64) 1573469 | |
[64, 128) 98151 | |
[128, 256) 14288 | |
[256, 512) 2035 | |
[512, 1K) 268 | |
[1K, 2K) 13 | |
Nice
This workload had wait_nr set to 20 and the timeout set to 500 µs.
Empirically, I know that any task work done > 50 will violate the
latency limit for this workload. In these cases, all the requests must
be dropped. So even if excessive task work happens in a small % of time,
the impact is far larger than this.
So you've got a long tail, which spikes your nines, that makes sense.
On the other hand it's perhaps 5-10% of total, though hard to judge
as the [16,32) bucket is split by the constant 20. My guess would be
a small optimisation for the normal case adding a bit more to the
requeue may well worth it but depends on how sharp the skew in the
bucket is.
--
Pavel Begunkov