FYI we have exactly this issue on a machine here running CentOS 8.3 (kernel 4.18.0-240.1.1) (so presumably this happens in RHEL 8 too.) Controller is MSCC / Adaptec 3154-8i16e driving 60 x 12TB HGST drives configured as five x twelve-drive raid-6, software striped using md, and formatted with xfs. Test software writes to the array using multiple threads in parallel. The smartpqi driver would report controller offline within ten minutes or so, with status code 0x6100c Changed the driver to set 'nr_hw_queues = 1’ and then tested by filling the array with random files (which took a couple of days), which completed fine, so it looks like that one-line change fixes it. Would, of course, be helpful if this was back-ported. — Roger > On 3 Feb 2021, at 15:56, Don.Brace@xxxxxxxxxxxxx wrote: > > -----Original Message----- > From: Martin Wilck [mailto:mwilck@xxxxxxxx] > Subject: Re: [PATCH] scsi: scsi_host_queue_ready: increase busy count early > >> >> >> Confirmed my suspicions - it looks like the host is sent more commands >> than it can handle. We would need many disks to see this issue though, >> which you have. >> >> So for stable kernels, 6eb045e092ef is not in 5.4 . Next is 5.10, and >> I suppose it could be simply fixed by setting .host_tagset in scsi >> host template there. >> >> Thanks, >> John >> -- >> Don: Even though this works for current kernels, what would chances of >> this getting back-ported to 5.9 or even further? >> >> Otherwise the original patch smartpqi_fix_host_qdepth_limit would >> correct this issue for older kernels. > > True. However this is 5.12 material, so we shouldn't be bothered by that here. For 5.5 up to 5.9, you need a workaround. But I'm unsure whether smartpqi_fix_host_qdepth_limit would be the solution. > You could simply divide can_queue by nr_hw_queues, as suggested before, or even simpler, set nr_hw_queues = 1. > > How much performance would that cost you? > > Don: For my HBA disk tests... > > Dividing can_queue / nr_hw_queues is about a 40% drop. > ~380K - 400K IOPS > Setting nr_hw_queues = 1 results in a 1.5 X gain in performance. > ~980K IOPS > Setting host_tagset = 1 > ~640K IOPS > > So, it seem that setting nr_hw_queues = 1 results in the best performance. > > Is this expected? Would this also be true for the future? > > Thanks, > Don Brace > > Below is my setup. > --- > [3:0:0:0] disk HP EG0900FBLSK HPD7 /dev/sdd > [3:0:1:0] disk HP EG0900FBLSK HPD7 /dev/sde > [3:0:2:0] disk HP EG0900FBLSK HPD7 /dev/sdf > [3:0:3:0] disk HP EH0300FBQDD HPD5 /dev/sdg > [3:0:4:0] disk HP EG0900FDJYR HPD4 /dev/sdh > [3:0:5:0] disk HP EG0300FCVBF HPD9 /dev/sdi > [3:0:6:0] disk HP EG0900FBLSK HPD7 /dev/sdj > [3:0:7:0] disk HP EG0900FBLSK HPD7 /dev/sdk > [3:0:8:0] disk HP EG0900FBLSK HPD7 /dev/sdl > [3:0:9:0] disk HP MO0200FBRWB HPD9 /dev/sdm > [3:0:10:0] disk HP MM0500FBFVQ HPD8 /dev/sdn > [3:0:11:0] disk ATA MM0500GBKAK HPGC /dev/sdo > [3:0:12:0] disk HP EG0900FBVFQ HPDC /dev/sdp > [3:0:13:0] disk HP VO006400JWZJT HP00 /dev/sdq > [3:0:14:0] disk HP VO015360JWZJN HP00 /dev/sdr > [3:0:15:0] enclosu HP D3700 5.04 - > [3:0:16:0] enclosu HP D3700 5.04 - > [3:0:17:0] enclosu HPE Smart Adapter 3.00 - > [3:1:0:0] disk HPE LOGICAL VOLUME 3.00 /dev/sds > [3:2:0:0] storage HPE P408e-p SR Gen10 3.00 - > ----- > [global] > ioengine=libaio > ; rw=randwrite > ; percentage_random=40 > rw=write > size=100g > bs=4k > direct=1 > ramp_time=15 > ; filename=/mnt/fio_test > ; cpus_allowed=0-27 > iodepth=4096 > > [/dev/sdd] > [/dev/sde] > [/dev/sdf] > [/dev/sdg] > [/dev/sdh] > [/dev/sdi] > [/dev/sdj] > [/dev/sdk] > [/dev/sdl] > [/dev/sdm] > [/dev/sdn] > [/dev/sdo] > [/dev/sdp] > [/dev/sdq] > [/dev/sdr] > > > Distribution kernels would be yet another issue, distros can backport host_tagset and get rid of the issue. > > Regards > Martin > > > > > > > > > >