On 06/30/2017 09:08 AM, Jens Axboe wrote: >>>> Compared with the totally percpu approach, this way might help 1:M or >>>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to >>>> each CPU(especially there are huge hw queues on a big system), :-( >>> >>> Not disagreeing with that, without having some mechanism to only >>> loop queues that have pending requests. That would be similar to the >>> ctx_map for sw to hw queues. But I don't think that would be worthwhile >>> doing, I like your pnode approach better. However, I'm still not fully >>> convinced that one per node is enough to get the scalability we need. >>> >>> Would be great if Brian could re-test with your updated patch, so we >>> know how it works for him at least. >> >> I'll try running with both approaches today and see how they compare. > > Focus on Ming's, a variant of that is the most likely path forward, > imho. It'd be great to do a quick run on mine as well, just to establish > how it compares to mainline, though. On my initial runs, the one from you Jens, appears to perform a bit better, although both are a huge improvement from what I was seeing before. I ran 4k random reads using fio to nullblk in two configurations on my 20 core system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 threads to a single null_blk as well as 80 threads to 80 null_block devices, so one thread per null_blk. This is what I saw on this machine: Using the Per node atomic change from Ming Lei 1 null_blk, 80 threads iops=9376.5K 80 null_blk, 1 thread iops=9523.5K Using the alternate patch from Jens using the tags 1 null_blk, 80 threads iops=9725.8K 80 null_blk, 1 thread iops=9569.4K Its interesting that with this change the single device, 80 threads scenario actually got better than the 80 null_blk scenario. I'll try on a larger machine as well. I've got a 32 core machine I can try this on too. Next week I can work with our performance team on running this on a system with a bunch of nvme devices so we can then test the disk partition case as well and see if there is any noticeable overhead. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center