On Mon, Apr 25, 2022 at 11:58:55AM -0500, Bob Pearson wrote: > On 4/24/22 19:04, Yanjun Zhu wrote: > > 在 2022/4/23 5:04, Bob Pearson 写道: > >> Local operations in the rdma_rxe driver are not obviously idempotent. But, the > >> RC retry mechanism backs up the send queue to the point of the wqe that is > >> currently being acknowledged and re-walks the sq. Each send or write operation is > >> retried with the exception that the first one is truncated by the packets already > >> having been acknowledged. Each read and atomic operation is resent except that > >> read data already received in the first wqe is not requested. But all the > >> local operations are replayed. The problem is local invalidate which is destructive. > >> For example > > > > Is there any example or just your analysis? > > I have a colleague at HPE who is testing Lustre/o2iblnd/rxe. They are testing over a > highly reliable network so do not expect to see dropped or out of order packets. But they > see multiple timeout flows. When working on rping a week ago I also saw lots of timeouts > and verified that the timeout code in rxe has the behavior that when a new RC operation is > sent the retry timer is modified to go off at jiffies + qp->timeout_jiffies but only if > there is not a currently pending timer. Once set it is never cleared so it will fire > typically a few msec later initiating a retry flow. If IO operations are frequent then > there will be a timeout every few msec (about 20 times a second for typical timeout values.) > o2iblnd uses fast reg MRs to write data to the target system and then local invalidate > operations to invalidate the MR and then increments the key portion of the rkey and resets > the map and then does a reg mr operation. Retry flows cause the local invalidate and reg MR > operations to be re-executed over and over again. A single retry can cause a half a dozen > invalidate operations to be run with various rkeys and they mostly fail because they don't > match the current MR. This results in Lustre crashing. > > Currently I am actually happy that the unneeded retries are happening because it makes > testing the retry code a lot easier. But eventually it would be good to clear or reset the timer > after the operation is completed which would greatly reduce the number of retries. Also > it will be important to figure out how the IBA intended for local invalidates and reg MRs to > work. The way they are now they cannot be successfully retried. Also marking them as done > and skipping them in the retry sequence does not work. (It breaks some of the blktests test > cases.) local operations on a QP are not supposed to need retry because they are not supposed to go on the network, so backing up the sq past its current position should not re-execute any local operations until the sq passes its actual head. Or, stated differently, you have a head/tail pointer for local work and a head/tail pointer for network work and the two track independently within the defined ordering constraints. Jason