On 10/30/24 2:25 PM, Jens Axboe wrote: > On 10/30/24 11:20 AM, Jann Horn wrote: >> On Wed, Oct 30, 2024 at 5:58?PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>> This avoids array_index_nospec() for repeated lookups on the same node, >>> which can be quite common (and costly). If a cached node is removed from >> >> You're saying array_index_nospec() can be quite costly - which >> architecture is this on? Is this the cost of the compare+subtract+and >> making the critical path longer? > > Tested this on arm64, in a vm to be specific. Let me try and generate > some numbers/profiles on x86-64 as well. It's noticeable there as well, > though not quite as bad as the below example. For arm64, with the patch, > we get roughly 8.7% of the time spent getting a resource - without it's > 66% of the time. This is just doing a microbenchmark, but it clearly > shows that anything following the barrier on arm64 is very costly: > > 0.98 ? ldr x21, [x0, #96] > ? ? tbnz w2, #1, b8 > 1.04 ? ldr w1, [x21, #144] > ? cmp w1, w19 > ? ? b.ls a0 > ? 30: mov w1, w1 > ? sxtw x0, w19 > ? cmp x0, x1 > ? ngc x0, xzr > ? csdb > ? ldr x1, [x21, #160] > ? and w19, w19, w0 > 93.98 ? ldr x19, [x1, w19, sxtw #3] > > and accounts for most of that 66% of the total cost of the micro bench, > even though it's doing a ton more stuff than simple getting this node > via a lookup. Ran some x86-64 testing, and there's no such effect on x86-64. So mostly useful on archs with more expensive array_index_nospec(). There's obviously a cost associated with it, but it's more of an even trade off in terms of having the extra branch vs the nospec indexing. Which means at that point you may as well not add the extra cache, as this particular case always hits it, and hence it's a best case kind of test. -- Jens Axboe