Jason, There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO. However, I am convinced that the current design only works by accident when it works. The thing that really makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre (actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all. We have a work around by fencing all the local operations which more or less works but will have bad performance. The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am trying to work out the semantics of everything. IBA view of FMRs: verb: ib_alloc_mr(pd, max_num_sg) - creates empty MR object roughly Allocate L_Key verb: ib_dereg_mr(mr) - destroys MR object verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset) - builds a map for MR roughly (Re)Register Physical Memory Region verb: ib_update_fast_reg_key(mr, newkey) - update key portion of l/rkey send wr: IB_WR_REG_MR(mr, key) - moves MR from FREE to VALID and updates roughly Fast Register Physical Memory Region key portion of l/rkey to key send_wr: IB_WR_LOCAL_INV(invalidate_rkey) - invalidate local MR moves MR to FREE send_wr: IB_SEND_WITH_INV(invalidate_rkey) - invalidate remote MR moves MR to FREE To make this all recoverable in the face of errors let there be more than one map present for an FMR indexed by the key portion of the l/rkeys. Alternative view of FMRs: verb: ib_alloc_mr(pd, max_num_sg) - create an empty MR object with no maps with l/rkey = [index, key] with index fixed and key some initial value. verb: ib_update_fast_reg_key(mr, newkey) - update key portion of l/rkey verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset) - create a new map from allocated memory or by re-using an INVALID map. Maps are all the same size (max_num_sg). The key (index) of this map is the current key from l/rkey. The initial state of the map is FREE. (and thus not usable until a REG_MR work request is used.) verb: ib_dereg_mr(mr) - free all maps and the MR. send_wr: IB_WR_REG_MR(mr, key) - Find mr->map[key] and change its state to VALID. Associate QP with map since it will be hard to manage multiple QPs trying to use the same MR at the same time with differing states. Fail if the map is not FREE. A map with that key must have been created by ib_map_mr_sg with the same key previously. Check the current number of VALID maps and if this exceeds a limit pause the send queue until there is room to reg another MR. send_wr: IB_WR_LOCAL_INV (execute) - Lookup a map with the same index and key Change its state to FREE and dissociate from QP. Leave map contents the same. (complete) - When the local invalidate operation is completed (after all previous send queue WQEs have completed) change its state to INVALID and place map resources on a free list or free memory. send_wr: IB_SEND_WITH_INV - same except at remote end. retry: - if a retry occurs for a send queue. Back up the requester to the first incomplete PSN. Change the state of all maps which were VALID at that PSN back to VALID. This will require maintaining a list of valid maps at the boundary of completed and un-completed WQEs. Arrival of RDMA packet Lookup MR from index and map from key and if the state is VALID carry out the memory copy. This is an improvement over the current state. At the moment we have only two maps one for making new ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that you can which is false. This can be fixed easily by forcing all local operations to be fenced which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance. By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay. Even if you have a clean network the current design of the retransmit timer which is never cleared and which can fire frequently can make a mess of MB sized IOs used for storage. Bob