Re: [PATCH 1/1] block: Convert hd_struct in_flight from atomic to percpu

Ming Lei <tom.leiming@xxxxxxxxx> · Tue, 4 Jul 2017 09:20:57 +0800

On Sat, Jul 1, 2017 at 10:18 AM, Brian King <brking@xxxxxxxxxxxxxxxxxx> wrote:
> On 06/30/2017 06:26 PM, Jens Axboe wrote:
>> On 06/30/2017 05:23 PM, Ming Lei wrote:
>>> Hi Bian,
>>>
>>> On Sat, Jul 1, 2017 at 2:33 AM, Brian King <brking@xxxxxxxxxxxxxxxxxx> wrote:
>>>> On 06/30/2017 09:08 AM, Jens Axboe wrote:
>>>>>>>> Compared with the totally percpu approach, this way might help 1:M or
>>>>>>>> N:M mapping, but won't help 1:1 map(NVMe), when hctx is mapped to
>>>>>>>> each CPU(especially there are huge hw queues on a big system), :-(
>>>>>>>
>>>>>>> Not disagreeing with that, without having some mechanism to only
>>>>>>> loop queues that have pending requests. That would be similar to the
>>>>>>> ctx_map for sw to hw queues. But I don't think that would be worthwhile
>>>>>>> doing, I like your pnode approach better. However, I'm still not fully
>>>>>>> convinced that one per node is enough to get the scalability we need.
>>>>>>>
>>>>>>> Would be great if Brian could re-test with your updated patch, so we
>>>>>>> know how it works for him at least.
>>>>>>
>>>>>> I'll try running with both approaches today and see how they compare.
>>>>>
>>>>> Focus on Ming's, a variant of that is the most likely path forward,
>>>>> imho. It'd be great to do a quick run on mine as well, just to establish
>>>>> how it compares to mainline, though.
>>>>
>>>> On my initial runs, the one from you Jens, appears to perform a bit better, although
>>>> both are a huge improvement from what I was seeing before.
>>>>
>>>> I ran 4k random reads using fio to nullblk in two configurations on my 20 core
>>>> system with 4 NUMA nodes and 4-way SMT, so 80 logical CPUs. I ran both 80 threads
>>>> to a single null_blk as well as 80 threads to 80 null_block devices, so one thread
>>>
>>> Could you share what the '80 null_block devices' is?  It means you
>>> create 80 null_blk
>>> devices? Or you create one null_blk and make its hw queue number as 80
>>> via module
>>> parameter of ''submit_queues"?
>>
>> That's a valid question, was going to ask that too. But I assumed that Brian
>> used submit_queues to set as many queues as he has logical CPUs in the system.
>>>
>>> I guess we should focus on multi-queue case since it is the normal way of NVMe.
>>>
>>>> per null_blk. This is what I saw on this machine:
>>>>
>>>> Using the Per node atomic change from Ming Lei
>>>> 1 null_blk, 80 threads
>>>> iops=9376.5K
>>>>
>>>> 80 null_blk, 1 thread
>>>> iops=9523.5K
>>>>
>>>>
>>>> Using the alternate patch from Jens using the tags
>>>> 1 null_blk, 80 threads
>>>> iops=9725.8K
>>>>
>>>> 80 null_blk, 1 thread
>>>> iops=9569.4K
>>>
>>> If 1 thread means single fio job, looks the number is too too high, that means
>>> one random IO can complete in about 0.1us(100ns) on single CPU, not sure if it
>>> is possible, :-)
>>
>> It means either 1 null_blk device, 80 threads running IO to it. Or 80 null_blk
>> devices, each with a thread running IO to it. See above, he details that it's
>> 80 threads on 80 devices for that case.
>
> Right. So the two modes I'm running in are:
>
> 1. 80 null_blk devices, each with one submit_queue, with one fio job per null_blk device,
>    so 80 threads total. 80 logical CPUs
> 2. 1 null_blk device, with 80 submit_queues, 80 fio jobs, 80 logical CPUs.
>
> In theory, the two should result in similar numbers.
>
> Here are the commands and fio configurations:
>
> Scenario #1
> modprobe null_blk submit_queues=80 nr_devices=1 irqmode=0
>
> FIO config:
> [global]
> buffered=0
> invalidate=1
> bs=4k
> iodepth=64
> numjobs=80
> group_reporting=1
> rw=randrw
> rwmixread=100
> rwmixwrite=0
> ioengine=libaio
> runtime=60
> time_based
>
> [job1]
> filename=/dev/nullb0
>
>
> Scenario #2
> modprobe null_blk submit_queues=1 nr_devices=80 irqmode=0
>
> FIO config
> [global]
> buffered=0
> invalidate=1
> bs=4k
> iodepth=64
> numjobs=1
> group_reporting=1
> rw=randrw
> rwmixread=100
> rwmixwrite=0
> ioengine=libaio
> runtime=60
> time_based
>
> [job1]
> filename=/dev/nullb0
> [job2]
> filename=/dev/nullb1
> [job3]
> filename=/dev/nullb2
> [job4]
> filename=/dev/nullb3
> [job5]
> filename=/dev/nullb4
> [job6]
> filename=/dev/nullb5
> [job7]
> filename=/dev/nullb6
> [job8]
> filename=/dev/nullb7
> [job9]
> filename=/dev/nullb8
> [job10]
> filename=/dev/nullb9
> [job11]
> filename=/dev/nullb10
> [job12]
> filename=/dev/nullb11
> [job13]
> filename=/dev/nullb12
> [job14]
> filename=/dev/nullb13
> [job15]
> filename=/dev/nullb14
> [job16]
> filename=/dev/nullb15
> [job17]
> filename=/dev/nullb16
> [job18]
> filename=/dev/nullb17
> [job19]
> filename=/dev/nullb18
> [job20]
> filename=/dev/nullb19
> [job21]
> filename=/dev/nullb20
> [job22]
> filename=/dev/nullb21
> [job23]
> filename=/dev/nullb22
> [job24]
> filename=/dev/nullb23
> [job25]
> filename=/dev/nullb24
> [job26]
> filename=/dev/nullb25
> [job27]
> filename=/dev/nullb26
> [job28]
> filename=/dev/nullb27
> [job29]
> filename=/dev/nullb28
> [job30]
> filename=/dev/nullb29
> [job31]
> filename=/dev/nullb30
> [job32]
> filename=/dev/nullb31
> [job33]
> filename=/dev/nullb32
> [job34]
> filename=/dev/nullb33
> [job35]
> filename=/dev/nullb34
> [job36]
> filename=/dev/nullb35
> [job37]
> filename=/dev/nullb36
> [job38]
> filename=/dev/nullb37
> [job39]
> filename=/dev/nullb38
> [job40]
> filename=/dev/nullb39
> [job41]
> filename=/dev/nullb40
> [job42]
> filename=/dev/nullb41
> [job43]
> filename=/dev/nullb42
> [job44]
> filename=/dev/nullb43
> [job45]
> filename=/dev/nullb44
> [job46]
> filename=/dev/nullb45
> [job47]
> filename=/dev/nullb46
> [job48]
> filename=/dev/nullb47
> [job49]
> filename=/dev/nullb48
> [job50]
> filename=/dev/nullb49
> [job51]
> filename=/dev/nullb50
> [job52]
> filename=/dev/nullb51
> [job53]
> filename=/dev/nullb52
> [job54]
> filename=/dev/nullb53
> [job55]
> filename=/dev/nullb54
> [job56]
> filename=/dev/nullb55
> [job57]
> filename=/dev/nullb56
> [job58]
> filename=/dev/nullb57
> [job59]
> filename=/dev/nullb58
> [job60]
> filename=/dev/nullb59
> [job61]
> filename=/dev/nullb60
> [job62]
> filename=/dev/nullb61
> [job63]
> filename=/dev/nullb62
> [job64]
> filename=/dev/nullb63
> [job65]
> filename=/dev/nullb64
> [job66]
> filename=/dev/nullb65
> [job67]
> filename=/dev/nullb66
> [job68]
> filename=/dev/nullb67
> [job69]
> filename=/dev/nullb68
> [job70]
> filename=/dev/nullb69
> [job71]
> filename=/dev/nullb70
> [job72]
> filename=/dev/nullb71
> [job73]
> filename=/dev/nullb72
> [job74]
> filename=/dev/nullb73
> [job75]
> filename=/dev/nullb74
> [job76]
> filename=/dev/nullb75
> [job77]
> filename=/dev/nullb76
> [job78]
> filename=/dev/nullb77
> [job79]
> filename=/dev/nullb78
> [job80]
> filename=/dev/nullb79

IMO it should be more reasonable to use single null_blk with 80 queues
via setting submit_queues as 80 than simply 80 null_blks.  So suggest to switch
to test 80 queues in your future test.

thanks,
Ming Lei

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel