Re: bug report for rdma_rxe

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/22/22 16:04, Bob Pearson wrote:
> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
> RC retry mechanism backs up the send queue to the point of the wqe that is
> currently being acknowledged and re-walks the sq. Each send or write operation is
> retried with the exception that the first one is truncated by the packets already
> having been acknowledged. Each read and atomic operation is resent except that
> read data already received in the first wqe is not requested. But all the
> local operations are replayed. The problem is local invalidate which is destructive.
> For example
> 
> sq:	some operation that times out
> 	bind mw to mr
> 	some other operation
> 	invalidate mw
> 	invalidate mr
> 
> can't be replayed because invalidating the mr makes the second bind fail.
> There are lots of other examples where things go wrong.
> 
> To make things worse the send queue timer is never cleared and for typical
> timeout values goes off every few msec whether anything actually failed.
> 
> Bob

This looks like an unholy mess. The reason I was looking at it is because Lustre
on rxe doesn't work at the moment and the problems were traced to retry flows (on a very
reliable network) caused by stray timeouts. We see local_invalidate_mr operations
getting retried multiple times and not all of them succeed because the caller
is remapping the fast MR in the mean time and changing the rkey.

Bob



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux