On Wed, May 02, 2018 at 01:13:34PM +0530, Kashyap Desai wrote: > Hi Ming, > > I was running some performance test on latest 4.17-rc and figure out > performance drop (approximate 15% drop) due to below patch set. > https://marc.info/?l=linux-block&m=150802309522847&w=2 > > I observed drop on latest 4.16.6 stable and 4.17-rc kernel as well. Taking > bisect approach, figure out that Issue is not observed using last stable > kernel 4.14.38. > I pick 4.14.38 stable kernel as base line and applied above patch to > confirm the behavior. > > lscpu output - > > Architecture: x86_64 > CPU op-mode(s): 32-bit, 64-bit > Byte Order: Little Endian > CPU(s): 72 > On-line CPU(s) list: 0-71 > Thread(s) per core: 2 > Core(s) per socket: 18 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family: 6 > Model: 85 > Model name: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHz > Stepping: 4 > CPU MHz: 1457.182 > CPU max MHz: 2701.0000 > CPU min MHz: 1200.0000 > BogoMIPS: 5400.00 > Virtualization: VT-x > L1d cache: 32K > L1i cache: 32K > L2 cache: 1024K > L3 cache: 25344K > NUMA node0 CPU(s): 0-17,36-53 > NUMA node1 CPU(s): 18-35,54-71 > > I am having 16 SSDs - "SDLL1DLR400GCCA1". Created two R0 VD (each VD > consist of 8 SSDs) using MegaRaid Ventura series adapter. > > fio script - > numactl -N 1 fio 2vd.fio --bs=4k --iodepth=128 -rw=randread --group_report > --ioscheduler=none --numjobs=4 > > > | v4.14.38-stable | patched > v4.14.38-stable > | mq-none | mq-none > --------------------------------------------------------------------- > randread "iops" | 1597k | 1377k > > > Below is perf tool report without patch set. ( Looks like lock contention > is causing this drop, so provided relevant snippet) > > - 3.19% 2.89% fio [kernel.vmlinux] [k] > _raw_spin_lock > - 2.43% io_submit > - 2.30% entry_SYSCALL_64 > - do_syscall_64 > - 2.18% do_io_submit > - 1.59% blk_finish_plug > - 1.59% blk_flush_plug_list > - 1.59% blk_mq_flush_plug_list > - 1.00% __blk_mq_delay_run_hw_queue > - 0.99% blk_mq_sched_dispatch_requests > - 0.63% blk_mq_dispatch_rq_list > 0.60% scsi_queue_rq > - 0.57% blk_mq_sched_insert_requests > - 0.56% blk_mq_insert_requests > 0.51% _raw_spin_lock > > Below is perf tool report after applying patch set. > > - 4.10% 3.51% fio [kernel.vmlinux] [k] > _raw_spin_lock > - 3.09% io_submit > - 2.97% entry_SYSCALL_64 > - do_syscall_64 > - 2.85% do_io_submit > - 2.35% blk_finish_plug > - 2.35% blk_flush_plug_list > - 2.35% blk_mq_flush_plug_list > - 1.83% __blk_mq_delay_run_hw_queue > - 1.83% __blk_mq_run_hw_queue > - 1.83% blk_mq_sched_dispatch_requests > - 1.82% blk_mq_do_dispatch_ctx > - 1.14% blk_mq_dequeue_from_ctx > - 1.11% dispatch_rq_from_ctx > 1.03% _raw_spin_lock > 0.50% blk_mq_sched_insert_requests > > Let me know if you want more data or is this something a known implication > of patch-set ? The percpu lock of 'ctx->lock' shouldn't have taken so much CPU in dispatch_rq_from_ctx, and the reason may be that the single sbitmap is shared among all CPUs(nodes). So this issue may be same with your previous report, I will provide the per-host tagset patches against v4.17-rc3 for you to test this week. Could you run your benchmark and test patches against v4.17-rc kernel next time? BTW, could you update with us if the previous cpu lockup issue is fixed or not after commit adbe552349f2(scsi: megaraid_sas: fix selection of reply queue)? Actually we did discuss a bit about this kind of issue on last week's lsfmm. Thanks, Ming