Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large configurations with Intel ixgbe running FCOE

Laurence Oberman <loberman@xxxxxxxxxx> · Sun, 9 Oct 2016 11:52:44 -0400 (EDT)

----- Original Message -----
> From: "Laurence Oberman" <loberman@xxxxxxxxxx>
> To: "Hannes Reinecke" <hare@xxxxxxx>
> Cc: "Linux SCSI Mailinglist" <linux-scsi@xxxxxxxxxxxxxxx>, fcoe-devel@xxxxxxxxxxxxx, "Curtis Taylor (cjt@xxxxxxxxxx)"
> <cjt@xxxxxxxxxx>, "Bud Brown" <bubrown@xxxxxxxxxx>
> Sent: Saturday, October 8, 2016 3:44:16 PM
> Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by fc_queuecommand on NUMA or large
> configurations with Intel ixgbe running FCOE
> 
> 
> 
> ----- Original Message -----
> > From: "Laurence Oberman" <loberman@xxxxxxxxxx>
> > To: "Hannes Reinecke" <hare@xxxxxxx>
> > Cc: "Linux SCSI Mailinglist" <linux-scsi@xxxxxxxxxxxxxxx>,
> > fcoe-devel@xxxxxxxxxxxxx, "Curtis Taylor (cjt@xxxxxxxxxx)"
> > <cjt@xxxxxxxxxx>, "Bud Brown" <bubrown@xxxxxxxxxx>
> > Sent: Saturday, October 8, 2016 1:53:01 PM
> > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by
> > fc_queuecommand on NUMA or large
> > configurations with Intel ixgbe running FCOE
> > 
> > 
> > 
> > ----- Original Message -----
> > > From: "Hannes Reinecke" <hare@xxxxxxx>
> > > To: "Laurence Oberman" <loberman@xxxxxxxxxx>, "Linux SCSI Mailinglist"
> > > <linux-scsi@xxxxxxxxxxxxxxx>,
> > > fcoe-devel@xxxxxxxxxxxxx
> > > Cc: "Curtis Taylor (cjt@xxxxxxxxxx)" <cjt@xxxxxxxxxx>, "Bud Brown"
> > > <bubrown@xxxxxxxxxx>
> > > Sent: Saturday, October 8, 2016 1:35:19 PM
> > > Subject: Re: [Open-FCoE] Issue with fc_exch_alloc failing initiated by
> > > fc_queuecommand on NUMA or large
> > > configurations with Intel ixgbe running FCOE
> > > 
> > > On 10/08/2016 02:57 PM, Laurence Oberman wrote:
> > > > Hello
> > > >
> > > > This has been a tough problem to chase down but was finally reproduced.
> > > > This issue is apparent on RHEL kernels and upstream so justified
> > > > reporting
> > > > here.
> > > >
> > > > Its out there and some may not be aware its even happening other than
> > > > very
> > > > slow
> > >  > performance using ixgbe and software FCOE on large configurations.
> > > >
> > > > Upstream Kernel used for reproducing is 4.8.0
> > > >
> > > > I/O performance was noted to be very impacted on a large NUMA test
> > > > system
> > > > (64 CPUS 4 NUMA nodes) running the software fcoe stack with Intel ixgbe
> > > > interfaces.
> > > > After capturing blktraces we saw for every I/O there was at least one
> > > > blk_requeue_request and sometimes hundreds or more.
> > > > This resulted in IOPS rates being marginal at best with queuing and
> > > > high
> > > > wait times.
> > > > After narrowing this down with systemtap and trace-cmd we added further
> > > > debug and it was apparent this was dues to SCSI_MLQUEUE_HOST_BUSY being
> > > > returned.
> > > > So I/O passes but very slowly as it constantly having to be requeued.
> > > >
> > > > The identical configuration in our lab with a single NUMA node and 4
> > > > CPUS
> > > > does not see this issue at all.
> > > > The same large system that reproduces this was booted with numa=off and
> > > > still sees the issue.
> > > >
> > > Have you tested with my FCoE fixes?
> > > I've done quite some fixes for libfc/fcoe, and it would be nice to see
> > > how the patches behave with this setup.
> > > 
> > > > The flow is as follows:
> > > >
> > > > From with fc_queuecommand
> > > >           fc_fcp_pkt_send() calls fc_fcp_cmd_send() calls
> > > >           tt.exch_seq_send() which calls fc_exch_seq_send
> > > >
> > > > this fails and returns NULL in fc_exch_alloc() as the list traveral
> > > > never
> > > > creates a match.
> > > >
> > > > static struct fc_seq *fc_exch_seq_send(struct fc_lport *lport,
> > > > 				       struct fc_frame *fp,
> > > > 				       void (*resp)(struct fc_seq *,
> > > > 						    struct fc_frame *fp,
> > > > 						    void *arg),
> > > > 				       void (*destructor)(struct fc_seq *,
> > > > 							  void *),
> > > > 				       void *arg, u32 timer_msec)
> > > > {
> > > > 	struct fc_exch *ep;
> > > > 	struct fc_seq *sp = NULL;
> > > > 	struct fc_frame_header *fh;
> > > > 	struct fc_fcp_pkt *fsp = NULL;
> > > > 	int rc = 1;
> > > >
> > > > 	ep = fc_exch_alloc(lport, fp);     ***** Called Here and fails
> > > > 	if (!ep) {
> > > > 		fc_frame_free(fp);
> > > > 		printk("RHDEBUG: In fc_exch_seq_send returned NULL because !ep with
> > > > 		ep
> > > > 		=
> > > > 		%p\n",ep);
> > > > 		return NULL;
> > > > 	}
> > > > ..
> > > > ..
> > > > ]
> > > >
> > > >
> > > >  fc_exch_alloc() - Allocate an exchange from an EM on a
> > > >  *	/**
> > > >  *	     local port's list of EMs.
> > > >  * @lport: The local port that will own the exchange
> > > >  * @fp:	   The FC frame that the exchange will be for
> > > >  *
> > > >  * This function walks the list of exchange manager(EM)
> > > >  * anchors to select an EM for a new exchange allocation. The
> > > >  * EM is selected when a NULL match function pointer is encountered
> > > >  * or when a call to a match function returns true.
> > > >  */
> > > > static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport,
> > > > 					    struct fc_frame *fp)
> > > > {
> > > > 	struct fc_exch_mgr_anchor *ema;
> > > >
> > > > 	list_for_each_entry(ema, &lport->ema_list, ema_list)
> > > > 		if (!ema->match || ema->match(fp))
> > > > 			return fc_exch_em_alloc(lport, ema->mp);
> > > > 	return NULL;                                 ***** Never matches so
> > > > 	returns NULL
> > > > }
> > > >
> > > >
> > > > RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep = (null)
> > > > RHDEBUG: rc -1 with !seq = (null) after calling tt.exch_seq_send
> > > > within
> > > > fc_fcp_cmd_send
> > > > RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1
> > > > RHDEBUG: In fc_fcp_pkt_send, we returned from  rc =
> > > > lport->tt.fcp_cmd_send
> > > > with rc = -1
> > > >
> > > > RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in
> > > > fc_fcp_pkt_send=-1
> > > >
> > > > I am trying to get my head around why a large multi-node system sees
> > > > this
> > > > issue even with NUMA disabled.
> > > > Has anybody seen this or is aware of this with configurations (using
> > > > fc_queuecommand)
> > > >
> > > > I am continuing to add debug to narrow this down.
> > > >
> > > You might actually be hitting a limitation in the exchange manager code.
> > > The libfc exchange manager tries to be really clever and will assign a
> > > per-cpu exchange manager (probably to increase locality). However, we
> > > only have a limited number of exchanges, so on large systems we might
> > > actually run into a exchange starvation problem, where we have in theory
> > > enough free exchanges, but none for the submitting cpu.
> > > 
> > > (Personally, the exchange manager code is in urgent need of reworking.
> > > It should be replaced by the sbitmap code from Omar).
> > > 
> > > Do check how many free exchanges are actually present for the stalling
> > > CPU; it might be that you run into a starvation issue.
> > > 
> > > Cheers,
> > > 
> > > Hannes
> > > --
> > > Dr. Hannes Reinecke		      zSeries & Storage
> > > hare@xxxxxxx			      +49 911 74053 688
> > > SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
> > > GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)
> > > 
> > Hi Hannes,
> > Thanks for responding
> > 
> > I am adding additional debug as I type this.
> > 
> > I am using latest linux-next, I assume your latest FCOE are not in there
> > yet.
> > What is puzzling here is a identical kernel with 1 numa node and only 8GB
> > memory does not see this.
> > Surely if I was running out of exchanges that would show up on the smaller
> > system as well.
> > I am able to get to 1500 IOPS/sec with the same I/O exerciser on the
> > smaller
> > system with ZERO blk_requeue_request() calls.
> > Again, same kernel, same ixgbe same FCOE switch etc.
> > 
> > I traced specifically those initially because we saw it in the blktrace.
> > 
> > I dont understand the match stuff going on in the list reversal stuff here
> > very well, still trying to understand the code flow.
> > the bool match() I also cannot figure out from the code.
> > It runs the fc_exch_em_alloc() if ether bool *match is false or *match(fp)
> > I can't find the actual code for the match, will have to get a vmcore to
> > find
> > it.
> > 
> >  static inline struct fc_exch *fc_exch_alloc(struct fc_lport *lport,
> >  					    struct fc_frame *fp)
> >  {
> > 	struct fc_exch_mgr_anchor *ema;
> > 
> >  	list_for_each_entry(ema, &lport->ema_list, ema_list)
> >  		if (!ema->match || ema->match(fp))
> >  			return fc_exch_em_alloc(lport, ema->mp);
> >  	return NULL;                                 ***** Never matches so
> > 	returns NULL
> >  }
> > 
> > Will replay after some finer debug has been added
> > 
> > Again specific to fc_queuecommand and S/W FCOE, not an issue with the F/C
> > queuecommand in the HBA templates fro example lpfc or qla2xxx and alos not
> > applicable to full offload FCOE like the Emulex cards.
> > Thanks
> > Laurence
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> Hi Hannes
> 
> Replying to my own prior message.
> Added the additional debug
> 
> RHDEBUG: in fc_exch_em_alloc: returning NULL in err: path jumped from
> allocate new exch from pool because index == pool->next_index
> RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep =
> (null)
> RHDEBUG: rc -1 with !seq =           (null) after calling tt.exch_seq_send
> within fc_fcp_cmd_send
> RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1
> RHDEBUG: In fc_fcp_pkt_send, we returned from  rc = lport->tt.fcp_cmd_send
> with rc = -1
> RHDEBUG: We hit SCSI_MLQUEUE_HOST_BUSY in fc_queuecommand with rval in
> fc_fcp_pkt_send=-1
> 
> So we are actually failing in fc_exch_em_alloc with index ==
> pool->next_index, not in the list traversal as I originally thought.
> 
> This seems to then match what your said, we are running out of exchanges.
> 
> During my testing, if I start multiple dd's and ramp them up, for example 100
> parallel dd's with 64 CPUS I see this.
> On the 4 CPU system it does not happen.
> 
> /**
>  * fc_exch_em_alloc() - Allocate an exchange from a specified EM.
>  * @lport: The local port that the exchange is for
>  * @mp:    The exchange manager that will allocate the exchange
>  *
>  * Returns pointer to allocated fc_exch with exch lock held.
>  */
> static struct fc_exch *fc_exch_em_alloc(struct fc_lport *lport,
>                                         struct fc_exch_mgr *mp)
> {
>         struct fc_exch *ep;
>         unsigned int cpu;
>         u16 index;
>         struct fc_exch_pool *pool;
> 
>         /* allocate memory for exchange */
>         ep = mempool_alloc(mp->ep_pool, GFP_ATOMIC);
>         if (!ep) {
>                 atomic_inc(&mp->stats.no_free_exch);
>                 goto out;
>         }
>         memset(ep, 0, sizeof(*ep));
> 
>         cpu = get_cpu();
>         pool = per_cpu_ptr(mp->pool, cpu);
>         spin_lock_bh(&pool->lock);
>         put_cpu();
> 
>         /* peek cache of free slot */
>         if (pool->left != FC_XID_UNKNOWN) {
>                 index = pool->left;
>                 pool->left = FC_XID_UNKNOWN;
>                 goto hit;
>         }
>         if (pool->right != FC_XID_UNKNOWN) {
>                 index = pool->right;
>                 pool->right = FC_XID_UNKNOWN;
>                 goto hit;
>         }
> 
>         index = pool->next_index;
>         /* allocate new exch from pool */
>         while (fc_exch_ptr_get(pool, index)) {
>                 index = index == mp->pool_max_index ? 0 : index + 1;
>                 if (index == pool->next_index)
>                         goto err;
> 
> I will apply your latest FCOE patches, can you provide a link to your tree.
> 
> Thanks
> Laurence
> 
Replying to my own message.

We fail when index=0 and pool->next_index=0

RHDEBUG: in fc_exch_em_alloc: index=0 pool->next_index) = 0
RHDEBUG: in fc_exch_em_alloc: returning NULL in err: path jumped from  allocate new exch from pool because index == pool->next_index
RHDEBUG: In fc_exch_seq_send returned NULL because !ep with ep =           (null)
RHDEBUG: rc -1 with !seq =           (null) after calling tt.exch_seq_send  within fc_fcp_cmd_send
RHDEBUG: rc non zero in :unlock within fc_fcp_cmd_send = -1
RHDEBUG: In fc_fcp_pkt_send, we returned from  rc = lport->tt.fcp_cmd_send with rc = -1

Some additional information.

Does not happen with F/C adapters with same load to same array, just via lpfc and qla2xxx.
However we know that those have their own queuecommand in their scsi_template

I capped the number of CPUS to 4 and running enough parallel dd's I can still trigger this.

[root@fcoe-test-rhel6 ~]# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3
node 0 size: 147445 MB
node 0 free: 143676 MB
node distances:
node   0
  0:  10 

Its still a puzzle why the smaller system in my lab never sees this with same identical configuration.

Running 
for i in `seq 1 30`; do dd if=/dev/sdt of=/dev/null bs=512k iflag=direct count=10 & done   *** Does not trigger

Increment to 40
for i in `seq 1 40`; do dd if=/dev/sdt of=/dev/null bs=512k iflag=direct count=10 & done   *** Triggers the condition

I suspect we have many configurations out there running the ixgbe and software define FCOE stacks that are not even aware they are seeing this issue.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html