Re: CX314A WCE error: WR_FLUSH_ERR

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thanks Doug Ledford & Tom. I've found that QP is force switched into
Error status to flush outstandting WQEs into CQ with WR_FLUSH_ERR
status.

On 14:47 Wed 21 Aug, Doug Ledford wrote:
> On Wed, 2019-08-21 at 23:38 +0800, Liu, Changcheng wrote:
> > On 09:36 Wed 21 Aug, Tom Talpey wrote:
> > > On 8/21/2019 8:09 AM, Liu, Changcheng wrote:
> > > > Hi all,
> > > >     In one system, it always frequently hit "IBV_WC_WR_FLUSH_ERR"
> > > > in the WCE(work completion element) polled from completion queue
> > > > bound with RQ(Receive Queue).
> > > >     Does anyone has some idea to debug "IBV_WC_WR_FLUSH_ERR"
> > > > problem?
> > > > 
> > > >     With CX314A/40Gb NIC, I hit this error when using RC transport
> > > > type with only Send Operation(IBV_WR_SEND) WR(work request) on
> > > > SQ(Send Queue).
> > > >     Every WR only has one SGE(scatter/gather element) and all the
> > > > SGE on RQ has the same size. The SGE size in SQ WR is not greater
> > > > than the SGE size in RQ WR.
> > > > 
> > > >    There’s one explanation about IBV_WC_WR_FLUSH_ERR on page 114
> > > > in the "RDMA Aware Networks Programming User Manual" 
> > > > http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
> > > >    But I still didn't understand it well. How to trigger this
> > > > error with a short demo program?
> > > >    "
> > > >      IBV_WC_WR_FLUSH_ERR
> > > >      This event is generated when an invalid remote error is
> > > > thrown when the responder detects an
> > > >      invalid request. It may be that the operation is not
> > > > supported by the request queue or there is
> > > >      insufficient buffer space to receive the request.
> > > >    "
> > > 
> > > The most common reason for a flushed work request is loss of
> > > the connection to the remote peer. This can be caused by any
> > > number of conditions.
> > Good diretion. I'll debug it in this way first.
> > > The second-most common is a programming error in the upper
> > > layer protocol. A shortage of posted receives on either peer,
> > > a protection error on some buffer, etc.
> > Do you mean the protection key such as l_key/r_key isn't set well?
> > What's kind of protection error could trigger IBV_WC_WR_FLUSH_ERR?
> 
> FLUSH_ERR is the error used whenever a queue pair goes into an error
> state and there are still WQEs posted to the queue pair.  All
> outstanding WQEs are returned with the state IBV_WC_WR_FLUSH_ERR.  This
> is how you make sure you don't loose WQEs when the QP hits an error
> state.  So, literally *anything* that can cause a QP to go into an ERROR
> state will result in all WQEs currently posted to the QP being sent back
> with this FLUSH_ERR.  FLUSH_ERR literally just means that the card is
> flushing out the QP's work queue because now that the QP is in an error
> state it can't process the WQEs and, presumably, the application needs
> to know which ones completed and which ones didn't so it knows what to
> requeue once the QP is no longer in an error state.
> 
> As Tom has already pointed out, all of these things will throw the queue
> pair into an error state and cause all posted WQEs to be flushed with
> the FLUSH_ERR condition:
> 
> 1) Loss of queue pair connection
> 2) Any memory permission violation (attempt to write to read only
> memory, attempt to RDMA read/write to an invalid rkey, etc)
> 3) Receipt of any post_send message without a waiting post_recv buffer
> to accept the message
> 4) Receipt of a post_send message that is too large to fit in the first
> available post_recv buffer
> 
> A common cause of this sort of thing is when you don't do proper flow
> control on the queue pair and the sending side floods the receiving side
> and runs it out of posted recv WQEs.  Although, in your case, you did
> say this was happening on the receive queue, so that implies this is
> happening on the receiving side, so if that is what's happenining here,
> the process would have to be something like:
> 
> sender starts sending data (maybe without any flow control)
> 	receiver starts receiving data and refilling buffers
> 	...
> 	receiver runs totally dry of buffers and gets an incoming recv
> 	causing qp to go into error state
> 
> 	receiver then posts refill buffers to the RQ after the QP
> 	went into error state but before acknowledging the error state
> 	and shutting down the recv processing thread
> 
> 	all recv buffers posted as WQEs are flushed back to the process
> 	with FLUSH_ERR because they were posted to a QP in ERROR state
> 
> > > If you're looking to actually trigger this error for testing,
> > > well, try one of the above. If you're trying to figure out
> > > why it's happening, that can take some digging, but not in
> > > the RDMA stack, typically.
> > Many thanks.
> > 
> > --Changcheng
> > > Tom.
> > > 
> 
> -- 
> Doug Ledford <dledford@xxxxxxxxxx>
>     GPG KeyID: B826A3330E572FDD
>     Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux