RE: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce force_blk_mq

Kashyap Desai <kashyap.desai@xxxxxxxxxxxx> · Wed, 14 Feb 2018 11:58:33 +0530

> -----Original Message-----
> From: Ming Lei [mailto:ming.lei@xxxxxxxxxx]
> Sent: Tuesday, February 13, 2018 6:11 AM
> To: Kashyap Desai
> Cc: Hannes Reinecke; Jens Axboe; linux-block@xxxxxxxxxxxxxxx; Christoph
> Hellwig; Mike Snitzer; linux-scsi@xxxxxxxxxxxxxxx; Arun Easi; Omar
Sandoval;
> Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
Peter
> Rivera; Paolo Bonzini; Laurence Oberman
> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags & introduce
> force_blk_mq
>
> Hi Kashyap,
>
> On Tue, Feb 13, 2018 at 12:05:14AM +0530, Kashyap Desai wrote:
> > > -----Original Message-----
> > > From: Ming Lei [mailto:ming.lei@xxxxxxxxxx]
> > > Sent: Sunday, February 11, 2018 11:01 AM
> > > To: Kashyap Desai
> > > Cc: Hannes Reinecke; Jens Axboe; linux-block@xxxxxxxxxxxxxxx;
> > > Christoph Hellwig; Mike Snitzer; linux-scsi@xxxxxxxxxxxxxxx; Arun
> > > Easi; Omar
> > Sandoval;
> > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace;
> > Peter
> > > Rivera; Paolo Bonzini; Laurence Oberman
> > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > introduce force_blk_mq
> > >
> > > On Sat, Feb 10, 2018 at 09:00:57AM +0800, Ming Lei wrote:
> > > > Hi Kashyap,
> > > >
> > > > On Fri, Feb 09, 2018 at 02:12:16PM +0530, Kashyap Desai wrote:
> > > > > > -----Original Message-----
> > > > > > From: Ming Lei [mailto:ming.lei@xxxxxxxxxx]
> > > > > > Sent: Friday, February 9, 2018 11:01 AM
> > > > > > To: Kashyap Desai
> > > > > > Cc: Hannes Reinecke; Jens Axboe; linux-block@xxxxxxxxxxxxxxx;
> > > > > > Christoph Hellwig; Mike Snitzer; linux-scsi@xxxxxxxxxxxxxxx;
> > > > > > Arun Easi; Omar
> > > > > Sandoval;
> > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don
> > > > > > Brace;
> > > > > Peter
> > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global tags &
> > > > > > introduce force_blk_mq
> > > > > >
> > > > > > On Fri, Feb 09, 2018 at 10:28:23AM +0530, Kashyap Desai wrote:
> > > > > > > > -----Original Message-----
> > > > > > > > From: Ming Lei [mailto:ming.lei@xxxxxxxxxx]
> > > > > > > > Sent: Thursday, February 8, 2018 10:23 PM
> > > > > > > > To: Hannes Reinecke
> > > > > > > > Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > linux-block@xxxxxxxxxxxxxxx; Christoph Hellwig; Mike
> > > > > > > > Snitzer; linux-scsi@xxxxxxxxxxxxxxx; Arun Easi; Omar
> > > > > > > Sandoval;
> > > > > > > > Martin K . Petersen; James Bottomley; Christoph Hellwig;
> > > > > > > > Don Brace;
> > > > > > > Peter
> > > > > > > > Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support global
> > > > > > > > tags & introduce force_blk_mq
> > > > > > > >
> > > > > > > > On Thu, Feb 08, 2018 at 08:00:29AM +0100, Hannes Reinecke
> > wrote:
> > > > > > > > > On 02/07/2018 03:14 PM, Kashyap Desai wrote:
> > > > > > > > > >> -----Original Message-----
> > > > > > > > > >> From: Ming Lei [mailto:ming.lei@xxxxxxxxxx]
> > > > > > > > > >> Sent: Wednesday, February 7, 2018 5:53 PM
> > > > > > > > > >> To: Hannes Reinecke
> > > > > > > > > >> Cc: Kashyap Desai; Jens Axboe;
> > > > > > > > > >> linux-block@xxxxxxxxxxxxxxx; Christoph Hellwig; Mike
> > > > > > > > > >> Snitzer; linux-scsi@xxxxxxxxxxxxxxx; Arun Easi; Omar
> > > > > > > > > > Sandoval;
> > > > > > > > > >> Martin K . Petersen; James Bottomley; Christoph
> > > > > > > > > >> Hellwig; Don Brace;
> > > > > > > > > > Peter
> > > > > > > > > >> Rivera; Paolo Bonzini; Laurence Oberman
> > > > > > > > > >> Subject: Re: [PATCH 0/5] blk-mq/scsi-mq: support
> > > > > > > > > >> global tags & introduce force_blk_mq
> > > > > > > > > >>
> > > > > > > > > >> On Wed, Feb 07, 2018 at 07:50:21AM +0100, Hannes
> > > > > > > > > >> Reinecke
> > > > > wrote:
> > > > > > > > > >>> Hi all,
> > > > > > > > > >>>
> > > > > > > > > >>> [ .. ]
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Could you share us your patch for enabling
> > > > > > > > > >>>>> global_tags/MQ on
> > > > > > > > > >>>> megaraid_sas
> > > > > > > > > >>>>> so that I can reproduce your test?
> > > > > > > > > >>>>>
> > > > > > > > > >>>>>> See below perf top data. "bt_iter" is consuming 4
> > > > > > > > > >>>>>> times more
> > > > > > > CPU.
> > > > > > > > > >>>>>
> > > > > > > > > >>>>> Could you share us what the IOPS/CPU utilization
> > > > > > > > > >>>>> effect is after
> > > > > > > > > >>>> applying the
> > > > > > > > > >>>>> patch V2? And your test script?
> > > > > > > > > >>>> Regarding CPU utilization, I need to test one more
> > time.
> > > > > > > > > >>>> Currently system is in used.
> > > > > > > > > >>>>
> > > > > > > > > >>>> I run below fio test on total 24 SSDs expander
> > attached.
> > > > > > > > > >>>>
> > > > > > > > > >>>> numactl -N 1 fio jbod.fio --rw=randread
> > > > > > > > > >>>> --iodepth=64 --bs=4k --ioengine=libaio
> > > > > > > > > >>>> --rw=randread
> > > > > > > > > >>>>
> > > > > > > > > >>>> Performance dropped from 1.6 M IOPs to 770K IOPs.
> > > > > > > > > >>>>
> > > > > > > > > >>> This is basically what we've seen with earlier
> > iterations.
> > > > > > > > > >>
> > > > > > > > > >> Hi Hannes,
> > > > > > > > > >>
> > > > > > > > > >> As I mentioned in another mail[1], Kashyap's patch
> > > > > > > > > >> has a big issue,
> > > > > > > > > > which
> > > > > > > > > >> causes only reply queue 0 used.
> > > > > > > > > >>
> > > > > > > > > >> [1]
> > > > > > > > > >> https://marc.info/?l=linux-scsi&m=151793204014631&w=2
> > > > > > > > > >>
> > > > > > > > > >> So could you guys run your performance test again
> > > > > > > > > >> after fixing the
> > > > > > > > > > patch?
> > > > > > > > > >
> > > > > > > > > > Ming -
> > > > > > > > > >
> > > > > > > > > > I tried after change you requested.  Performance drop
> > > > > > > > > > is still
> > > > > > > unresolved.
> > > > > > > > > > From 1.6 M IOPS to 770K IOPS.
> > > > > > > > > >
> > > > > > > > > > See below data. All 24 reply queue is in used
correctly.
> > > > > > > > > >
> > > > > > > > > > IRQs / 1 second(s)
> > > > > > > > > > IRQ#  TOTAL  NODE0   NODE1  NAME
> > > > > > > > > >  360  16422      0   16422  IR-PCI-MSI 70254653-edge
> > megasas
> > > > > > > > > >  364  15980      0   15980  IR-PCI-MSI 70254657-edge
> > megasas
> > > > > > > > > >  362  15979      0   15979  IR-PCI-MSI 70254655-edge
> > megasas
> > > > > > > > > >  345  15696      0   15696  IR-PCI-MSI 70254638-edge
> > megasas
> > > > > > > > > >  341  15659      0   15659  IR-PCI-MSI 70254634-edge
> > megasas
> > > > > > > > > >  369  15656      0   15656  IR-PCI-MSI 70254662-edge
> > megasas
> > > > > > > > > >  359  15650      0   15650  IR-PCI-MSI 70254652-edge
> > megasas
> > > > > > > > > >  358  15596      0   15596  IR-PCI-MSI 70254651-edge
> > megasas
> > > > > > > > > >  350  15574      0   15574  IR-PCI-MSI 70254643-edge
> > megasas
> > > > > > > > > >  342  15532      0   15532  IR-PCI-MSI 70254635-edge
> > megasas
> > > > > > > > > >  344  15527      0   15527  IR-PCI-MSI 70254637-edge
> > megasas
> > > > > > > > > >  346  15485      0   15485  IR-PCI-MSI 70254639-edge
> > megasas
> > > > > > > > > >  361  15482      0   15482  IR-PCI-MSI 70254654-edge
> > megasas
> > > > > > > > > >  348  15467      0   15467  IR-PCI-MSI 70254641-edge
> > megasas
> > > > > > > > > >  368  15463      0   15463  IR-PCI-MSI 70254661-edge
> > megasas
> > > > > > > > > >  354  15420      0   15420  IR-PCI-MSI 70254647-edge
> > megasas
> > > > > > > > > >  351  15378      0   15378  IR-PCI-MSI 70254644-edge
> > megasas
> > > > > > > > > >  352  15377      0   15377  IR-PCI-MSI 70254645-edge
> > megasas
> > > > > > > > > >  356  15348      0   15348  IR-PCI-MSI 70254649-edge
> > megasas
> > > > > > > > > >  337  15344      0   15344  IR-PCI-MSI 70254630-edge
> > megasas
> > > > > > > > > >  343  15320      0   15320  IR-PCI-MSI 70254636-edge
> > megasas
> > > > > > > > > >  355  15266      0   15266  IR-PCI-MSI 70254648-edge
> > megasas
> > > > > > > > > >  335  15247      0   15247  IR-PCI-MSI 70254628-edge
> > megasas
> > > > > > > > > >  363  15233      0   15233  IR-PCI-MSI 70254656-edge
> > megasas
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Average:        CPU      %usr     %nice      %sys
> > %iowait
> > > > > > > %steal
> > > > > > > > > > %irq     %soft    %guest    %gnice     %idle
> > > > > > > > > > Average:         18      3.80      0.00     14.78
> > 10.08
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.01      0.00      0.00     67.33
> > > > > > > > > > Average:         19      3.26      0.00     15.35
> > 10.62
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.03      0.00      0.00     66.74
> > > > > > > > > > Average:         20      3.42      0.00     14.57
> > 10.67
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.84      0.00      0.00     67.50
> > > > > > > > > > Average:         21      3.19      0.00     15.60
> > 10.75
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.16      0.00      0.00     66.30
> > > > > > > > > > Average:         22      3.58      0.00     15.15
> > 10.66
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.51      0.00      0.00     67.11
> > > > > > > > > > Average:         23      3.34      0.00     15.36
> > 10.63
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.17      0.00      0.00     66.50
> > > > > > > > > > Average:         24      3.50      0.00     14.58
> > 10.93
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.85      0.00      0.00     67.13
> > > > > > > > > > Average:         25      3.20      0.00     14.68
> > 10.86
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.31      0.00      0.00     66.95
> > > > > > > > > > Average:         26      3.27      0.00     14.80
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.68      0.00      0.00     67.55
> > > > > > > > > > Average:         27      3.58      0.00     15.36
> > 10.80
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.79      0.00      0.00     66.48
> > > > > > > > > > Average:         28      3.46      0.00     15.17
> > 10.46
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.32      0.00      0.00     67.59
> > > > > > > > > > Average:         29      3.34      0.00     14.42
> > 10.72
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.34      0.00      0.00     68.18
> > > > > > > > > > Average:         30      3.34      0.00     15.08
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.89      0.00      0.00     66.99
> > > > > > > > > > Average:         31      3.26      0.00     15.33
> > 10.47
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.33      0.00      0.00     67.61
> > > > > > > > > > Average:         32      3.21      0.00     14.80
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.70      0.00      0.00     67.67
> > > > > > > > > > Average:         33      3.40      0.00     13.88
> > 10.55
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.02      0.00      0.00     68.15
> > > > > > > > > > Average:         34      3.74      0.00     17.41
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.51      0.00      0.00     63.73
> > > > > > > > > > Average:         35      3.35      0.00     14.37
> > 10.74
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.84      0.00      0.00     67.71
> > > > > > > > > > Average:         36      0.54      0.00      1.77
> > 0.00
> > > > > > > 0.00
> > > > > > > > > > 0.00      0.00      0.00      0.00     97.69
> > > > > > > > > > ..
> > > > > > > > > > Average:         54      3.60      0.00     15.17
> > 10.39
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.22      0.00      0.00     66.62
> > > > > > > > > > Average:         55      3.33      0.00     14.85
> > 10.55
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.96      0.00      0.00     67.31
> > > > > > > > > > Average:         56      3.40      0.00     15.19
> > 10.54
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.74      0.00      0.00     67.13
> > > > > > > > > > Average:         57      3.41      0.00     13.98
> > 10.78
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.10      0.00      0.00     67.73
> > > > > > > > > > Average:         58      3.32      0.00     15.16
> > 10.52
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.01      0.00      0.00     66.99
> > > > > > > > > > Average:         59      3.17      0.00     15.80
> > 10.35
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.86      0.00      0.00     66.80
> > > > > > > > > > Average:         60      3.00      0.00     14.63
> > 10.59
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.97      0.00      0.00     67.80
> > > > > > > > > > Average:         61      3.34      0.00     14.70
> > 10.66
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.32      0.00      0.00     66.97
> > > > > > > > > > Average:         62      3.34      0.00     15.29
> > 10.56
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.89      0.00      0.00     66.92
> > > > > > > > > > Average:         63      3.29      0.00     14.51
> > 10.72
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.85      0.00      0.00     67.62
> > > > > > > > > > Average:         64      3.48      0.00     15.31
> > 10.65
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.97      0.00      0.00     66.60
> > > > > > > > > > Average:         65      3.34      0.00     14.36
> > 10.80
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.11      0.00      0.00     67.39
> > > > > > > > > > Average:         66      3.13      0.00     14.94
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      4.10      0.00      0.00     67.13
> > > > > > > > > > Average:         67      3.06      0.00     15.56
> > 10.69
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.82      0.00      0.00     66.88
> > > > > > > > > > Average:         68      3.33      0.00     14.98
> > 10.61
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.81      0.00      0.00     67.27
> > > > > > > > > > Average:         69      3.20      0.00     15.43
> > 10.70
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.82      0.00      0.00     66.85
> > > > > > > > > > Average:         70      3.34      0.00     17.14
> > 10.59
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.00      0.00      0.00     65.92
> > > > > > > > > > Average:         71      3.41      0.00     14.94
> > 10.56
> > > > > > > 0.00
> > > > > > > > > > 0.00      3.41      0.00      0.00     67.69
> > > > > > > > > >
> > > > > > > > > > Perf top -
> > > > > > > > > >
> > > > > > > > > >   64.33%  [kernel]            [k] bt_iter
> > > > > > > > > >    4.86%  [kernel]            [k]
> > blk_mq_queue_tag_busy_iter
> > > > > > > > > >    4.23%  [kernel]            [k] _find_next_bit
> > > > > > > > > >    2.40%  [kernel]            [k]
> > > > > native_queued_spin_lock_slowpath
> > > > > > > > > >    1.09%  [kernel]            [k] sbitmap_any_bit_set
> > > > > > > > > >    0.71%  [kernel]            [k] sbitmap_queue_clear
> > > > > > > > > >    0.63%  [kernel]            [k] find_next_bit
> > > > > > > > > >    0.54%  [kernel]            [k]
_raw_spin_lock_irqsave
> > > > > > > > > >
> > > > > > > > > Ah. So we're spending quite some time in trying to find
> > > > > > > > > a free
> > > > > tag.
> > > > > > > > > I guess this is due to every queue starting at the same
> > > > > > > > > position trying to find a free tag, which inevitably
> > > > > > > > > leads
> > to a
> > > contention.
> > > > > > > >
> > > > > > > > IMO, the above trace means that blk_mq_in_flight() may be
> > > > > > > > the
> > > > > > > bottleneck,
> > > > > > > > and looks not related with tag allocation.
> > > > > > > >
> > > > > > > > Kashyap, could you run your performance test again after
> > > > > > > > disabling
> > > > > > > iostat by
> > > > > > > > the following command on all test devices and killing all
> > > > > > > > utilities
> > > > > > > which may
> > > > > > > > read iostat(/proc/diskstats, ...)?
> > > > > > > >
> > > > > > > > 	echo 0 > /sys/block/sdN/queue/iostat
> > > > > > >
> > > > > > > Ming - After changing iostat = 0 , I see performance issue
> > > > > > > is
> > > > > resolved.
> > > > > > >
> > > > > > > Below is perf top output after iostats = 0
> > > > > > >
> > > > > > >
> > > > > > >   23.45%  [kernel]             [k] bt_iter
> > > > > > >    2.27%  [kernel]             [k]
blk_mq_queue_tag_busy_iter
> > > > > > >    2.18%  [kernel]             [k] _find_next_bit
> > > > > > >    2.06%  [megaraid_sas]       [k] complete_cmd_fusion
> > > > > > >    1.87%  [kernel]             [k] clflush_cache_range
> > > > > > >    1.70%  [kernel]             [k] dma_pte_clear_level
> > > > > > >    1.56%  [kernel]             [k] __domain_mapping
> > > > > > >    1.55%  [kernel]             [k] sbitmap_queue_clear
> > > > > > >    1.30%  [kernel]             [k] gup_pgd_range
> > > > > >
> > > > > > Hi Kashyap,
> > > > > >
> > > > > > Thanks for your test and update.
> > > > > >
> > > > > > Looks blk_mq_queue_tag_busy_iter() is still sampled by perf
> > > > > > even though iostats is disabled, and I guess there may be
> > > > > > utilities which are
> > > > > reading iostats
> > > > > > a bit frequently.
> > > > >
> > > > > I  will be doing some more testing and post you my findings.
> > > >
> > > > I will find sometime this weekend to see if I can cook a patch to
> > > > address this issue of io accounting.
> > >
> > > Hi Kashyap,
> > >
> > > Please test the top 5 patches in the following tree to see if
> > megaraid_sas's
> > > performance is OK:
> > >
> > > 	https://github.com/ming1/linux/commits/v4.15-for-next-global-tags-
> > > v2
> > >
> > > This tree is made by adding these 5 patches against patchset V2.
> > >
> >
> > Ming -
> > I applied 5 patches on top of V2 and behavior is still unchanged.
> > Below is perf top data. (1000K IOPS)
> >
> >   34.58%  [kernel]                 [k] bt_iter
> >    2.96%  [kernel]                 [k] sbitmap_any_bit_set
> >    2.77%  [kernel]                 [k] bt_iter_global_tags
> >    1.75%  [megaraid_sas]           [k] complete_cmd_fusion
> >    1.62%  [kernel]                 [k] sbitmap_queue_clear
> >    1.62%  [kernel]                 [k] _raw_spin_lock
> >    1.51%  [kernel]                 [k] blk_mq_run_hw_queue
> >    1.45%  [kernel]                 [k] gup_pgd_range
> >    1.31%  [kernel]                 [k] irq_entries_start
> >    1.29%  fio                      [.] __fio_gettime
> >    1.13%  [kernel]                 [k] _raw_spin_lock_irqsave
> >    0.95%  [kernel]                 [k]
native_queued_spin_lock_slowpath
> >    0.92%  [kernel]                 [k] scsi_queue_rq
> >    0.91%  [kernel]                 [k] blk_mq_run_hw_queues
> >    0.85%  [kernel]                 [k] blk_mq_get_request
> >    0.81%  [kernel]                 [k] switch_mm_irqs_off
> >    0.78%  [megaraid_sas]           [k] megasas_build_io_fusion
> >    0.77%  [kernel]                 [k] __schedule
> >    0.73%  [kernel]                 [k] update_load_avg
> >    0.69%  [kernel]                 [k] fput
> >    0.65%  [kernel]                 [k] scsi_dispatch_cmd
> >    0.64%  fio                      [.] fio_libaio_event
> >    0.53%  [kernel]                 [k] do_io_submit
> >    0.52%  [kernel]                 [k] read_tsc
> >    0.51%  [megaraid_sas]           [k]
megasas_build_and_issue_cmd_fusion
> >    0.51%  [kernel]                 [k] scsi_softirq_done
> >    0.50%  [kernel]                 [k] kobject_put
> >    0.50%  [kernel]                 [k] cpuidle_enter_state
> >    0.49%  [kernel]                 [k] native_write_msr
> >    0.48%  fio                      [.] io_completed
> >
> > Below is perf top data with iostat=0  (1400K IOPS)
> >
> >    4.87%  [kernel]                      [k] sbitmap_any_bit_set
> >    2.93%  [kernel]                      [k] _raw_spin_lock
> >    2.84%  [megaraid_sas]                [k] complete_cmd_fusion
> >    2.38%  [kernel]                      [k] irq_entries_start
> >    2.36%  [kernel]                      [k] gup_pgd_range
> >    2.35%  [kernel]                      [k] blk_mq_run_hw_queue
> >    2.30%  [kernel]                      [k] sbitmap_queue_clear
> >    2.01%  fio                           [.] __fio_gettime
> >    1.78%  [kernel]                      [k] _raw_spin_lock_irqsave
> >    1.51%  [kernel]                      [k] scsi_queue_rq
> >    1.43%  [kernel]                      [k] blk_mq_run_hw_queues
> >    1.36%  [kernel]                      [k] fput
> >    1.32%  [kernel]                      [k] __schedule
> >    1.31%  [kernel]                      [k] switch_mm_irqs_off
> >    1.29%  [kernel]                      [k] update_load_avg
> >    1.25%  [megaraid_sas]                [k] megasas_build_io_fusion
> >    1.22%  [kernel]                      [k]
> > native_queued_spin_lock_slowpath
> >    1.03%  [kernel]                      [k] scsi_dispatch_cmd
> >    1.03%  [kernel]                      [k] blk_mq_get_request
> >    0.91%  fio                           [.] fio_libaio_event
> >    0.89%  [kernel]                      [k] scsi_softirq_done
> >    0.87%  [kernel]                      [k] kobject_put
> >    0.86%  [kernel]                      [k] cpuidle_enter_state
> >    0.84%  fio                           [.] io_completed
> >    0.83%  [kernel]                      [k] do_io_submit
> >    0.83%  [megaraid_sas]                [k]
> > megasas_build_and_issue_cmd_fusion
> >    0.83%  [kernel]                      [k] __switch_to
> >    0.82%  [kernel]                      [k] read_tsc
> >    0.80%  [kernel]                      [k] native_write_msr
> >    0.76%  [kernel]                      [k] aio_comp
> >
> >
> > Perf data without V2 patch applied.  (1600K IOPS)
> >
> >    5.97%  [megaraid_sas]           [k] complete_cmd_fusion
> >    5.24%  [kernel]                 [k] bt_iter
> >    3.28%  [kernel]                 [k] _raw_spin_lock
> >    2.98%  [kernel]                 [k] irq_entries_start
> >    2.29%  fio                      [.] __fio_gettime
> >    2.04%  [kernel]                 [k] scsi_queue_rq
> >    1.92%  [megaraid_sas]           [k] megasas_build_io_fusion
> >    1.61%  [kernel]                 [k] switch_mm_irqs_off
> >    1.59%  [megaraid_sas]           [k]
megasas_build_and_issue_cmd_fusion
> >    1.41%  [kernel]                 [k] scsi_dispatch_cmd
> >    1.33%  [kernel]                 [k] scsi_softirq_done
> >    1.18%  [kernel]                 [k] gup_pgd_range
> >    1.18%  [kernel]                 [k] blk_mq_complete_request
> >    1.13%  [kernel]                 [k] blk_mq_free_request
> >    1.05%  [kernel]                 [k] do_io_submit
> >    1.04%  [kernel]                 [k] _find_next_bit
> >    1.02%  [kernel]                 [k] blk_mq_get_request
> >    0.95%  [megaraid_sas]           [k] megasas_build_ldio_fusion
> >    0.95%  [kernel]                 [k] scsi_dec_host_busy
> >    0.89%  fio                      [.] get_io_u
> >    0.88%  [kernel]                 [k] entry_SYSCALL_64
> >    0.84%  [megaraid_sas]           [k] megasas_queue_command
> >    0.79%  [kernel]                 [k] native_write_msr
> >    0.77%  [kernel]                 [k] read_tsc
> >    0.73%  [kernel]                 [k] _raw_spin_lock_irqsave
> >    0.73%  fio                      [.] fio_libaio_commit
> >    0.72%  [kernel]                 [k] kmem_cache_alloc
> >    0.72%  [kernel]                 [k] blkdev_direct_IO
> >    0.69%  [megaraid_sas]           [k] MR_GetPhyParams
> >    0.68%  [kernel]                 [k] blk_mq_dequeue_f
>
> The above data is very helpful to understand the issue, great thanks!
>
> With this patchset V2 and the 5 patches, if iostats is set as 0, IOPS is
1400K, but
> 1600K IOPS can be reached without all these patches with iostats as 1.
>
> BTW, could you share us what the machine is? ARM64? I saw ARM64's cache
> coherence performance is bad before. In the dual socket system(each
socket
> has 8 X86 CPU cores) I tested, only ~0.5% IOPS drop can be observed
after the
> 5 patches are applied on V2 in null_blk test, which is described in
commit log.

I am using Intel Skylake/Lewisburg/Purley.

>
> Looks it means single sbitmap can't perform well under MQ's case in
which
> there will be much more concurrent submissions and completions. In case
of
> single hw queue(current linus tree), one hctx->run_work only allows one
> __blk_mq_run_hw_queue() running at 'async' mode, and reply queues are
> used in round-robin way, which may cause contention on the single
sbitmap
> too, especially io accounting may consume a bit much more CPU, I guess
that
> may contribute some on the CPU lockup.
>
> Could you run your test without V2 patches by setting 'iostats' as 0?

Tested without V2 patch set. Iostat=1.  IOPS = 1600K

   5.93%  [megaraid_sas]              [k] complete_cmd_fusion
   5.34%  [kernel]                    [k] bt_iter
   3.23%  [kernel]                    [k] _raw_spin_lock
   2.92%  [kernel]                    [k] irq_entries_start
   2.57%  fio                         [.] __fio_gettime
   2.10%  [kernel]                    [k] scsi_queue_rq
   1.98%  [megaraid_sas]              [k] megasas_build_io_fusion
   1.93%  [kernel]                    [k] switch_mm_irqs_off
   1.79%  [megaraid_sas]              [k]
megasas_build_and_issue_cmd_fusion
   1.45%  [kernel]                    [k] scsi_softirq_done
   1.42%  [kernel]                    [k] scsi_dispatch_cmd
   1.23%  [kernel]                    [k] blk_mq_complete_request
   1.11%  [megaraid_sas]              [k] megasas_build_ldio_fusion
   1.11%  [kernel]                    [k] gup_pgd_range
   1.08%  [kernel]                    [k] blk_mq_free_request
   1.03%  [kernel]                    [k] do_io_submit
   1.02%  [kernel]                    [k] _find_next_bit
   1.00%  [kernel]                    [k] scsi_dec_host_busy
   0.94%  [kernel]                    [k] blk_mq_get_request
   0.93%  [megaraid_sas]              [k] megasas_queue_command
   0.92%  [kernel]                    [k] native_write_msr
   0.85%  fio                         [.] get_io_u
   0.83%  [kernel]                    [k] entry_SYSCALL_64
   0.83%  [kernel]                    [k] _raw_spin_lock_irqsave
   0.82%  [kernel]                    [k] read_tsc
   0.81%  [sd_mod]                    [k] sd_init_command
   0.67%  [kernel]                    [k] kmem_cache_alloc
   0.63%  [kernel]                    [k] memset_erms
   0.63%  [kernel]                    [k] aio_read_events
   0.62%  [kernel]                    [k] blkdev_dir

Tested without V2 patch set. Iostat=0. IOPS = 1600K

   5.79%  [megaraid_sas]           [k] complete_cmd_fusion
   3.28%  [kernel]                 [k] _raw_spin_lock
   3.28%  [kernel]                 [k] irq_entries_start
   2.10%  [kernel]                 [k] scsi_queue_rq
   1.96%  fio                      [.] __fio_gettime
   1.85%  [megaraid_sas]           [k] megasas_build_io_fusion
   1.68%  [megaraid_sas]           [k] megasas_build_and_issue_cmd_fusion
   1.36%  [kernel]                 [k] gup_pgd_range
   1.36%  [kernel]                 [k] scsi_dispatch_cmd
   1.28%  [kernel]                 [k] do_io_submit
   1.25%  [kernel]                 [k] switch_mm_irqs_off
   1.20%  [kernel]                 [k] blk_mq_free_request
   1.18%  [megaraid_sas]           [k] megasas_build_ldio_fusion
   1.11%  [kernel]                 [k] dput
   1.07%  [kernel]                 [k] scsi_softirq_done
   1.07%  fio                      [.] get_io_u
   1.07%  [kernel]                 [k] scsi_dec_host_busy
   1.02%  [kernel]                 [k] blk_mq_get_request
   0.96%  [sd_mod]                 [k] sd_init_command
   0.92%  [kernel]                 [k] entry_SYSCALL_64
   0.89%  [kernel]                 [k] blk_mq_make_request
   0.87%  [kernel]                 [k] blkdev_direct_IO
   0.84%  [kernel]                 [k] blk_mq_complete_request
   0.78%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.77%  [kernel]                 [k] lookup_ioctx
   0.76%  [megaraid_sas]           [k] MR_GetPhyParams
   0.75%  [kernel]                 [k] blk_mq_dequeue_from_ctx
   0.75%  [kernel]                 [k] memset_erms
   0.74%  [kernel]                 [k] kmem_cache_alloc
   0.72%  [megaraid_sas]           [k] megasas_queue_comman
> and could you share us what the .can_queue is in this HBA?

can_queue = 8072. In my test I used --iodepth=128 for 12 SCSI device (R0
Volume.) FIO will only push 1536 outstanding commands.

>
> >
> >
> > > If possible, please provide us the performance data without these
> > patches and
> > > with these patches, together with perf trace.
> > >
> > > The top 5 patches are for addressing the io accounting issue, and
> > > which should be the main reason for your performance drop, even
> > > lockup in megaraid_sas's ISR, IMO.
> >
> > I think performance drop is different issue. May be a side effect of
> > the patch set. Even though we fix this perf issue, cpu lock up is
> > completely different issue.
>
> The performance drop is caused by the global data structure of sbitmap
which
> is accessed from all CPUs concurrently.
>
> > Regarding cpu lock up, there was similar discussion and folks are
> > finding irq poll is good method to resolve lockup.  Not sure why NVME
> > driver did not opted irq_poll, but there was extensive discussion and
> > I am also
>
> NVMe's hw queues won't use host wide tags, so no such issue.
>
> > seeing cpu lock up mainly due to multiple completion queue/reply queue
> > is tied to single CPU. We have weighing method in irq poll to quit ISR
> > and that is the way we can avoid lock-up.
> > http://lists.infradead.org/pipermail/linux-nvme/2017-January/007724.ht
> > ml
>
> This patch can make sure that one request is always completed in the
> submission CPU, but contention on the global sbitmap is too big and
causes
> performance drop.
>
> Now looks this is really an interesting topic for discussion.
>
>
> Thanks,
> Ming