On 02/16/2017 10:48 AM, Kashyap Desai wrote: >> -----Original Message----- >> From: Hannes Reinecke [mailto:hare@xxxxxxx] >> Sent: Wednesday, February 15, 2017 3:35 PM >> To: Kashyap Desai; Sreekanth Reddy >> Cc: Christoph Hellwig; Martin K. Petersen; James Bottomley; linux- >> scsi@xxxxxxxxxxxxxxx; Sathya Prakash Veerichetty; PDL-MPT-FUSIONLINUX >> Subject: Re: [PATCH 00/10] mpt3sas: full mq support >> >> On 02/15/2017 10:18 AM, Kashyap Desai wrote: >>>> >>>> >>>> Hannes, >>>> >>>> Result I have posted last time is with merge operation enabled in >>>> block layer. If I disable merge operation then I don't see much >>>> improvement with multiple hw request queues. Here is the result, >>>> >>>> fio results when nr_hw_queues=1, >>>> 4k read when numjobs=24: io=248387MB, bw=1655.1MB/s, iops=423905, >>>> runt=150003msec >>>> >>>> fio results when nr_hw_queues=24, >>>> 4k read when numjobs=24: io=263904MB, bw=1759.4MB/s, iops=450393, >>>> runt=150001msec >>> >>> Hannes - >>> >>> I worked with Sreekanth and also understand pros/cons of Patch #10. >>> " [PATCH 10/10] mpt3sas: scsi-mq interrupt steering" >>> >>> In above patch, can_queue of HBA is divided based on logic CPU, it >>> means we want to mimic as if mpt3sas HBA support multi queue >>> distributing actual resources which is single Submission H/W Queue. >>> This approach badly impact many performance areas. >>> >>> nr_hw_queues = 1 is what I observe as best performance approach since >>> it never throttle IO if sdev->queue_depth is set to HBA queue depth. >>> In case of nr_hw_queues = "CPUs" throttle IO at SCSI level since we >>> never allow more than "updated can_queue" in LLD. >>> >> True. >> And this was actually one of the things I wanted to demonstrate with this >> patchset :-) ATM blk-mq really works best when having a distinct tag space >> per port/device. As soon as the hardware provides a _shared_ tag space you >> end up with tag starvation issues as blk-mq only allows you to do a static >> split of the available tagspace. >> While this patchset demonstrates that the HBA itself _does_ benefit from >> using block-mq (especially on highly parallel loads), it also demonstrates >> that >> _block-mq_ has issues with singlethreaded loads on this HBA (or, rather, >> type of HBA, as I doubt this issue is affecting mpt3sas only). >> >>> Below code bring actual HBA can_queue very low ( Ea on 96 logical core >>> CPU new can_queue goes to 42, if HBA queue depth is 4K). It means we >>> will see lots of IO throttling in scsi mid layer due to >>> shost->can_queue reach the limit very soon if you have <fio> jobs with >> higher QD. >>> >>> if (ioc->shost->nr_hw_queues > 1) { >>> ioc->shost->nr_hw_queues = ioc->msix_vector_count; >>> ioc->shost->can_queue /= ioc->msix_vector_count; >>> } >>> I observe negative performance if I have 8 SSD drives attached to >>> Ventura (latest IT controller). 16 fio jobs at QD=128 gives ~1600K >>> IOPs and the moment I switch to nr_hw_queues = "CPUs", it gave hardly >>> ~850K IOPs. This is mainly because of host_busy stuck at very low ~169 >>> on >> my setup. >>> >> Which actually might be an issue with the way scsi is hooked into blk-mq. >> The SCSI stack is using 'can_queue' as a check for 'host_busy', ie if the >> host is >> capable of accepting more commands. >> As we're limiting can_queue (to get the per-queue command depth >> correctly) we should be using the _overall_ command depth for the >> can_queue value itself to make the host_busy check work correctly. >> >> I've attached a patch for that; can you test if it makes a difference? > Hannes - > Attached patch works fine for me. FYI - We need to set device queue depth > to can_queue as we are currently not doing in mpt3sas driver. > > With attached patch when I tried, I see ~2-3% improvement running multiple > jobs. Single job profile no difference. > > So looks like we are good to reach performance with single nr_hw_queues. > Whee, cool. > We have some patches to be send so want to know how to rebase this patch > series as few patches coming from Broadcom. Can we consider below as plan ? > Sure, can do. > - Patches from 1-7 will be reposted. Also Sreekanth will complete review on > existing patch 1-7. > - We need blk_tag support only for nr_hw_queue = 1. > > With that say, we will have many code changes/function without " > shost_use_blk_mq" check and assume it is single nr_hw_queue supported > <mpt3sas> driver. > > Ea - Below function can be simplify - just refer tag from scmd->request and > don't need check of shost_use_blk_mq + nr_hw_queue etc.. > > u16 > mpt3sas_base_get_smid_scsiio(struct MPT3SAS_ADAPTER *ioc, u8 cb_idx, > struct scsi_cmnd *scmd) > { > struct scsiio_tracker *request; > u16 smid = scmd->request->tag + 1; > ... > return request->smid; > } > > - Later we can explore if nr_hw_queue more than one really add benefit. > From current limited testing, I don't see major performance boost if we have > nr_hw_queue more than one. > Well, the _actual_ code to support mq is rather trivial, and really serves as a good testbed for scsi-mq. I would prefer to leave it in, and disable it via a module parameter. But in either case, I can rebase the patches to leave any notions of 'nr_hw_queues' to patch 8 for implementing full mq support. And we need to discuss how to handle MPI2_FUNCTION_SCSI_IO_REQUEST; the current method doesn't work with blk-mq. I really would like to see that go, especially as sg/bsg supports the same functionality ... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton HRB 21284 (AG Nürnberg)