Re: bug report for rdma_rxe

Zhu Yanjun <zyjzyj2000@xxxxxxxxx> · Tue, 26 Apr 2022 11:10:06 +0800

On Tue, Apr 26, 2022 at 12:58 AM Bob Pearson <rpearsonhpe@xxxxxxxxx> wrote:
>
> On 4/24/22 19:04, Yanjun Zhu wrote:
> > 在 2022/4/23 5:04, Bob Pearson 写道:
> >> Local operations in the rdma_rxe driver are not obviously idempotent. But, the
> >> RC retry mechanism backs up the send queue to the point of the wqe that is
> >> currently being acknowledged and re-walks the sq. Each send or write operation is
> >> retried with the exception that the first one is truncated by the packets already
> >> having been acknowledged. Each read and atomic operation is resent except that
> >> read data already received in the first wqe is not requested. But all the
> >> local operations are replayed. The problem is local invalidate which is destructive.
> >> For example
> >
> > Is there any example or just your analysis?
>
> I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a
> highly reliable network so do not expect to see dropped or out of order packets. But they
> see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts
> and verified that the timeout code in rxe has the behavior that when a new RC operation is
> sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if
> there is not a currently pending timer. Once set it is never cleared so it will fire
> typically a few msec later initiating a retry flow. If IO operations are frequent then
> there will be a timeout every few msec (about 20 times a second for typical timeout values.)
> o2iblnd uses fast reg MRs to write data to the target system and then local invalidate
> operations to invalidate the MR and then increments the key portion of the rkey and resets
> the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR
> operations to be re-executed over and over again. A single retry can cause a half a dozen
> invalidate operations to be run with various rkeys and they mostly fail because they don't
> match the current MR. This results in Lustre crashing.
>
> Currently I am actually happy that the unneeded retries are happening because it makes
> testing the retry code a lot easier. But eventually it would be good to clear or reset the timer
> after the operation is completed which would greatly reduce the number of retries. Also

This retry is triggered by RDMA/RXE or by some applications? And the
mr is used the original one or allocate a new one?

If this retry is triggered by RDMA/RXE, and the original mr is freed
and a new one is allocated, it is very similar to ODP.
Currently ODP is not supported in RXE, but the behavior is very similar.

Zhu Yanjun

> it will be important to figure out how the IBA intended for local invalidates and reg MRs to
> work. The way they are now they cannot be successfully retried. Also marking them as done
> and skipping them in the retry sequence does not work. (It breaks some of the blktests test
> cases.)
>
> > You know, sometimes your analysis is not always correct.
> > To prove your analysis, please show us some solid example.
> >
> > Zhu Yanjun
> >
> >>
> >> sq:    some operation that times out
> >>     bind mw to mr
> >>     some other operation
> >>     invalidate mw
> >>     invalidate mr
> >>
> >> can't be replayed because invalidating the mr makes the second bind fail.
> >> There are lots of other examples where things go wrong.
> >>
> >> To make things worse the send queue timer is never cleared and for typical
> >> timeout values goes off every few msec whether anything actually failed.
> >>
> >> Bob
> >
>