> -----Original Message----- > From: Laurence Oberman [mailto:loberman@xxxxxxxxxx] > Sent: Wednesday, February 28, 2018 9:52 PM > To: Ming Lei; Kashyap Desai > Cc: Jens Axboe; linux-block@xxxxxxxxxxxxxxx; Christoph Hellwig; Mike > Snitzer; > linux-scsi@xxxxxxxxxxxxxxx; Hannes Reinecke; Arun Easi; Omar Sandoval; > Martin K . Petersen; James Bottomley; Christoph Hellwig; Don Brace; Peter > Rivera > Subject: Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance > via > .host_tagset > > On Wed, 2018-02-28 at 23:21 +0800, Ming Lei wrote: > > On Wed, Feb 28, 2018 at 08:28:48PM +0530, Kashyap Desai wrote: > > > Ming - > > > > > > Quick testing on my setup - Performance slightly degraded (4-5% > > > drop)for megaraid_sas driver with this patch. (From 1610K IOPS it > > > goes to > > > 1544K) > > > I confirm that after applying this patch, we have #queue = #numa > > > node. > > > > > > ls -l > > > /sys/devices/pci0000:80/0000:80:02.0/0000:83:00.0/host10/target10:2 > > > :23/10: > > > 2:23:0/block/sdy/mq > > > total 0 > > > drwxr-xr-x. 18 root root 0 Feb 28 09:53 0 drwxr-xr-x. 18 root root 0 > > > Feb 28 09:53 1 > > > > OK, thanks for your test. > > > > As I mentioned to you, this patch should have improved performance on > > megaraid_sas, but the current slight degrade might be caused by > > scsi_host_queue_ready() in scsi_queue_rq(), I guess. > > > > With .host_tagset enabled and use per-numa-node hw queue, request can > > be queued to lld more frequently/quick than single queue, then the > > cost of > > atomic_inc_return(&host->host_busy) may be increased much meantime, > > think about millions of such operations, and finally slight IOPS drop > > is observed when the hw queue depth becomes half of .can_queue. > > > > > > > > > > > I would suggest to skip megaraid_sas driver changes using > > > shared_tagset until and unless there is obvious gain. If overall > > > interface of using shared_tagset is commit in kernel tree, we will > > > investigate (megaraid_sas > > > driver) in future about real benefit of using it. > > > > I'd suggest to not merge it until it is proved that performance can be > > improved in real device. Noted. > > > > I will try to work to remove the expensive atomic_inc_return(&host- > > >host_busy) > > from scsi_queue_rq(), since it isn't needed for SCSI_MQ, once it is > > done, will ask you to test again. Ming - Do you mean removing host_busy stats from scsi_queue_rq() will still provide correct value in host_busy whenever IO reach to LLD ? > > > > > > Thanks, > > Ming > > I will test this here as well > I just put the Megaraid card in to my system here > > Kashyap, do you have ssd's on the back-end and are you you using jbods or > virtual devices. Let me have your config. > I only have 6G sas shelves though. Laurence - I am using 12 SSD drives in JBOD mode OR single drive R0 mode. Single SSD is capable of ~138K IOPS (4K RR). With all 12 SSDs performance scale linearly and goes upto ~1610K IOPS. I think if you have 6G SAS fully loaded, you may need more number of drives to reach 1600K IOPs (sequential load with nomerges=2 on HDD is required to avoid IO merge at block layer.) SSD model I am using is - HGST - " HUSMH8020BSS200" Here is lscpu output of my setup - lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz Stepping: 1 CPU MHz: 1726.217 BogoMIPS: 4199.37 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 > > Regards > Laurence