On 09:36 Wed 21 Aug, Tom Talpey wrote: > On 8/21/2019 8:09 AM, Liu, Changcheng wrote: > > Hi all, > > In one system, it always frequently hit "IBV_WC_WR_FLUSH_ERR" in the WCE(work completion element) polled from completion queue bound with RQ(Receive Queue). > > Does anyone has some idea to debug "IBV_WC_WR_FLUSH_ERR" problem? > > > > With CX314A/40Gb NIC, I hit this error when using RC transport type with only Send Operation(IBV_WR_SEND) WR(work request) on SQ(Send Queue). > > Every WR only has one SGE(scatter/gather element) and all the SGE on RQ has the same size. The SGE size in SQ WR is not greater than the SGE size in RQ WR. > > > > There’s one explanation about IBV_WC_WR_FLUSH_ERR on page 114 in the "RDMA Aware Networks Programming User Manual" http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf > > But I still didn't understand it well. How to trigger this error with a short demo program? > > " > > IBV_WC_WR_FLUSH_ERR > > This event is generated when an invalid remote error is thrown when the responder detects an > > invalid request. It may be that the operation is not supported by the request queue or there is > > insufficient buffer space to receive the request. > > " > > The most common reason for a flushed work request is loss of > the connection to the remote peer. This can be caused by any > number of conditions. Good diretion. I'll debug it in this way first. > > The second-most common is a programming error in the upper > layer protocol. A shortage of posted receives on either peer, > a protection error on some buffer, etc. Do you mean the protection key such as l_key/r_key isn't set well? What's kind of protection error could trigger IBV_WC_WR_FLUSH_ERR? > > If you're looking to actually trigger this error for testing, > well, try one of the above. If you're trying to figure out > why it's happening, that can take some digging, but not in > the RDMA stack, typically. Many thanks. --Changcheng > > Tom. >