On Tue, May 10, 2022 at 02:44:50PM +0100, John Garry wrote: > On 10/05/2022 13:50, Jens Axboe wrote: > > > fio config: > > > bs=4096, iodepth=128, numjobs=10, cpus_allowed_policy=split, rw=read, > > > ioscheduler=none > > > > > > Before: > > > 7130K > > > > > > After: > > > 7630K > > > > > > So a +7% IOPS gain. > > Thanks for having a look. > > > What does the comparison run on a non-NUMA non-shared queue look like? > > Because I bet it'd be slower. > > I could test more to get a solid result for that. > > > > > To be honest, I don't like this approach at all. It makes the normal > > case quite a bit slower by having an extra layer of indirection for the > > word, that's quite a bit of extra cost. > > Yes, there is the extra load. I would hope that there would be a low cost, > but I agree that we still want to avoid it. So prob no point in testing this > more. > > > It doesn't seem like a good > > approach for the issue, as it pessimizes the normal fast case. > > > > Spreading the memory out does probably make sense, but we need to retain > > the fast normal case. Making sbitmap support both, selected at init > > time, would be far more likely to be acceptable imho. > > I wanted to keep the code changes minimal for an initial RFC to test the > water. > > My original approach did not introduce the extra load for normal path and > had some init time selection for a normal word map vs numa word map, but the > code grew and became somewhat unmanageable. I'll revisit it to see how to > improve that. I understand this approach just splits shared sbitmap into per-numa-node part, but what if all IOs are just from CPUs in one same numa node? Doesn't this way cause tag starvation and waste? Thanks, Ming