----- Original Message ----- > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > To: "Hannes Reinecke" <hare@xxxxxxx> > Cc: "Linux SCSI Mailinglist" <linux-scsi@xxxxxxxxxxxxxxx>, fcoe-devel@xxxxxxxxxxxxx, "Curtis Taylor (cjt@xxxxxxxxxx)" > <cjt@xxxxxxxxxx>, "Bud Brown" <bubrown@xxxxxxxxxx> > Sent: Saturday, October 8, 2016 3:44:16 PM > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large > configurations with Intel ixgbe running FCOE > > > > ----- Original Message ----- > > From: "Laurence Oberman" <loberman@xxxxxxxxxx> > > To: "Hannes Reinecke" <hare@xxxxxxx> > > Cc: "Linux SCSI Mailinglist" <linux-scsi@xxxxxxxxxxxxxxx>, > > fcoe-devel@xxxxxxxxxxxxx, "Curtis Taylor (cjt@xxxxxxxxxx)" > > <cjt@xxxxxxxxxx>, "Bud Brown" <bubrown@xxxxxxxxxx> > > Sent: Saturday, October 8, 2016 1:53:01 PM > > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > > fc_queuecommand on NUMA or large > > configurations with Intel ixgbe running FCOE > > > > > > > > ----- Original Message ----- > > > From: "Hannes Reinecke" <hare@xxxxxxx> > > > To: "Laurence Oberman" <loberman@xxxxxxxxxx>, "Linux SCSI Mailinglist" > > > <linux-scsi@xxxxxxxxxxxxxxx>, > > > fcoe-devel@xxxxxxxxxxxxx > > > Cc: "Curtis Taylor (cjt@xxxxxxxxxx)" <cjt@xxxxxxxxxx>, "Bud Brown" > > > <bubrown@xxxxxxxxxx> > > > Sent: Saturday, October 8, 2016 1:35:19 PM > > > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by > > > fc_queuecommand on NUMA or large > > > configurations with Intel ixgbe running FCOE > > > > > > On 10/08/2016 02:57 PM, Laurence Oberman wrote: > > > > Hello > > > > > > > > This has been a tough problem to chase down but was finally reproduced. > > > > This issue is apparent on RHEL kernels and upstream so justified > > > > reporting > > > > here. > > > > > > > > Its out there and some may not be aware its even happening other than > > > > very > > > > slow > > > > performance using ixgbe and software FCOE on large configurations. > > > > > > > > Upstream Kernel used for reproducing is 4.8.0 > > > > > > > > I/O performance was noted to be very impacted on a large NUMA test > > > > system > > > > (64 CPUS 4 NUMA nodes) running the software fcoe stack with Intel ixgbe > > > > interfaces. > > > > After capturing blktraces we saw for every I/O there was at least one > > > > blk_requeue_request and sometimes hundreds or more. > > > > This resulted in IOPS rates being marginal at best with queuing and > > > > high > > > > wait times. > > > > After narrowing this down with systemtap and trace-cmd we added further > > > > debug and it was apparent this was dues to SCSI_MLQUEUE_HOST_BUSY being > > > > returned. > > > > So I/O passes but very slowly as it constantly having to be requeued. > > > > > > > > The identical configuration in our lab with a single NUMA node and 4 > > > > CPUS > > > > does not see this issue at all. > > > > The same large system that reproduces this was booted with numa=off and > > > > still sees the issue. > > > > > > > Have you tested with my FCoE fixes? > > > I've done quite some fixes for libfc/fcoe, and it would be nice to see > > > how the patches behave with this setup. > > > > > > > The flow is as follows: > > > > > > > > From with fc_queuecommand > > > > fc_fcp_pkt_send() calls fc_fcp_cmd_send() calls > > > > tt.exch_seq_send() which calls fc_exch_seq_send > > > > > > > > this fails and returns NULL in fc_exch_alloc() as the list traveral > > > > never > > > > creates a match. > > > > > > > > static struct fc_seq *fc_exch_seq_send(struct fc_lport *lport, > > > > struct fc_frame *fp, > > > > void (*resp)(struct fc_seq *, > > > > struct fc_frame *fp, > > > > void *arg), > > > > void (*destructor)(struct fc_seq *, > > > > void *), > > > > void *arg, u32 timer_msec) > > > > { > > > > struct fc_exch *ep; > > > > struct fc_seq *sp = NULL; > > > > struct fc_frame_header *fh; > > > > struct fc_fcp_pkt *fsp = NULL; > > > > int rc = 1; > > > > > > > > ep = fc_exch_alloc(lport, fp); ***** Called Here and fails > > > > if (!ep) { > > > > fc_frame_free(fp); > > > > printk("RHDEBUG: In fc_exch_seq_send returned NULL because !ep with > > > > ep > > > > = > > > > %p\n",ep); > > > > return NULL; > > > > } > > > > .. > > > > .. > > > > ] > > > > > > > > > > > > fc_exch_alloc() - Allocate an exchange from an EM on a > > > > * /** > > > > * local port's list of EMs. > > > > * @lport: The local port that will own the exchange > > > > * @fp: The FC frame that the exchange will be for > > > > * > > > > * This function walks the list of exchange manager(EM) > > > > * anchors to select an EM for a new exchange allocation. The > > > > * EM is selected when a NULL match function pointer is encountered > > > > * or when a call to a match function returns true. > > > > */ > > > > static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport, > > > > struct fc_frame *fp) > > > > { > > > > struct fc_exch_mgr_anchor *ema; > > > > > > > > list_for_each_entry(ema, &lport->ema_list, ema_list) > > > > if (!ema->match || ema->match(fp)) > > > > return fc_exch_em_alloc(lport, ema->mp); > > > > return NULL; ***** Never matches so > > > > returns NULL > > > > } > > > > > > > > > > > > RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = (null) > > > > RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send > > > > within > > > > fc_fcp_cmd_send > > > > RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1 > > > > RHDEBUG: In fc_fcp_pkt_send, we returned from rc = > > > > lport->tt.fcp_cmd_send > > > > with rc = -1 > > > > > > > > RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in > > > > fc_fcp_pkt_send=-1 > > > > > > > > I am trying to get my head around why a large multi-node system sees > > > > this > > > > issue even with NUMA disabled. > > > > Has anybody seen this or is aware of this with configurations (using > > > > fc_queuecommand) > > > > > > > > I am continuing to add debug to narrow this down. > > > > > > > You might actually be hitting a limitation in the exchange manager code. > > > The libfc exchange manager tries to be really clever and will assign a > > > per-cpu exchange manager (probably to increase locality). However, we > > > only have a limited number of exchanges, so on large systems we might > > > actually run into a exchange starvation problem, where we have in theory > > > enough free exchanges, but none for the submitting cpu. > > > > > > (Personally, the exchange manager code is in urgent need of reworking. > > > It should be replaced by the sbitmap code from Omar). > > > > > > Do check how many free exchanges are actually present for the stalling > > > CPU; it might be that you run into a starvation issue. > > > > > > Cheers, > > > > > > Hannes > > > -- > > > Dr. Hannes Reinecke zSeries & Storage > > > hare@xxxxxxx +49 911 74053 688 > > > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg > > > GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg) > > > > > Hi Hannes, > > Thanks for responding > > > > I am adding additional debug as I type this. > > > > I am using latest linux-next, I assume your latest FCOE are not in there > > yet. > > What is puzzling here is a identical kernel with 1 numa node and only 8GB > > memory does not see this. > > Surely if I was running out of exchanges that would show up on the smaller > > system as well. > > I am able to get to 1500 IOPS/sec with the same I/O exerciser on the > > smaller > > system with ZERO blk_requeue_request() calls. > > Again, same kernel, same ixgbe same FCOE switch etc. > > > > I traced specifically those initially because we saw it in the blktrace. > > > > I dont understand the match stuff going on in the list reversal stuff here > > very well, still trying to understand the code flow. > > the bool match() I also cannot figure out from the code. > > It runs the fc_exch_em_alloc() if ether bool *match is false or *match(fp) > > I can't find the actual code for the match, will have to get a vmcore to > > find > > it. > > > > static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport, > > struct fc_frame *fp) > > { > > struct fc_exch_mgr_anchor *ema; > > > > list_for_each_entry(ema, &lport->ema_list, ema_list) > > if (!ema->match || ema->match(fp)) > > return fc_exch_em_alloc(lport, ema->mp); > > return NULL; ***** Never matches so > > returns NULL > > } > > > > Will replay after some finer debug has been added > > > > Again specific to fc_queuecommand and S/W FCOE, not an issue with the F/C > > queuecommand in the HBA templates fro example lpfc or qla2xxx and alos not > > applicable to full offload FCOE like the Emulex cards. > > Thanks > > Laurence > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > Hi Hannes > > Replying to my own prior message. > Added the additional debug > > RHDEBUG: in fc_exch_em_alloc: returning NULL in err: path jumped from > allocate new exch from pool because index == pool->next_index > RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = > (null) > RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send > within fc_fcp_cmd_send > RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1 > RHDEBUG: In fc_fcp_pkt_send, we returned from rc = lport->tt.fcp_cmd_send > with rc = -1 > RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in > fc_fcp_pkt_send=-1 > > So we are actually failing in fc_exch_em_alloc with index == > pool->next_index, not in the list traversal as I originally thought. > > This seems to then match what your said, we are running out of exchanges. > > During my testing, if I start multiple dd's and ramp them up, for example 100 > parallel dd's with 64 CPUS I see this. > On the 4 CPU system it does not happen. > > /** > * fc_exch_em_alloc() - Allocate an exchange from a specified EM. > * @lport: The local port that the exchange is for > * @mp: The exchange manager that will allocate the exchange > * > * Returns pointer to allocated fc_exch with exch lock held. > */ > static struct fc_exch *fc_exch_em_alloc(struct fc_lport *lport, > struct fc_exch_mgr *mp) > { > struct fc_exch *ep; > unsigned int cpu; > u16 index; > struct fc_exch_pool *pool; > > /* allocate memory for exchange */ > ep = mempool_alloc(mp->ep_pool, GFP_ATOMIC); > if (!ep) { > atomic_inc(&mp->stats.no_free_exch); > goto out; > } > memset(ep, 0, sizeof(*ep)); > > cpu = get_cpu(); > pool = per_cpu_ptr(mp->pool, cpu); > spin_lock_bh(&pool->lock); > put_cpu(); > > /* peek cache of free slot */ > if (pool->left != FC_XID_UNKNOWN) { > index = pool->left; > pool->left = FC_XID_UNKNOWN; > goto hit; > } > if (pool->right != FC_XID_UNKNOWN) { > index = pool->right; > pool->right = FC_XID_UNKNOWN; > goto hit; > } > > index = pool->next_index; > /* allocate new exch from pool */ > while (fc_exch_ptr_get(pool, index)) { > index = index == mp->pool_max_index ? 0 : index + 1; > if (index == pool->next_index) > goto err; > > I will apply your latest FCOE patches, can you provide a link to your tree. > > Thanks > Laurence > Replying to my own message. We fail when index=0 and pool->next_index=0 RHDEBUG: in fc_exch_em_alloc: index=0 pool->next_index) = 0 RHDEBUG: in fc_exch_em_alloc: returning NULL in err: path jumped from allocate new exch from pool because index == pool->next_index RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = (null) RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send within fc_fcp_cmd_send RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1 RHDEBUG: In fc_fcp_pkt_send, we returned from rc = lport->tt.fcp_cmd_send with rc = -1 Some additional information. Does not happen with F/C adapters with same load to same array, just via lpfc and qla2xxx. However we know that those have their own queuecommand in their scsi_template I capped the number of CPUS to 4 and running enough parallel dd's I can still trigger this. [root@fcoe-test-rhel6 ~]# numactl --hardware available: 1 nodes (0) node 0 cpus: 0 1 2 3 node 0 size: 147445 MB node 0 free: 143676 MB node distances: node 0 0: 10 Its still a puzzle why the smaller system in my lab never sees this with same identical configuration. Running for i in `seq 1 30`; do dd if=/dev/sdt of=/dev/null bs=512k iflag=direct count=10 & done *** Does not trigger Increment to 40 for i in `seq 1 40`; do dd if=/dev/sdt of=/dev/null bs=512k iflag=direct count=10 & done *** Triggers the condition I suspect we have many configurations out there running the ixgbe and software define FCOE stacks that are not even aware they are seeing this issue. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html