On 25/05/2022 06:28, Bob Pearson wrote: > Jason, > > There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with > the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into > two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO. > > However, I am convinced that the current design only works by accident when it works. The thing that really > makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre > (actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue > with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local > work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all. > > We have a work around by fencing all the local operations which more or less works but will have bad performance. > The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am > trying to work out the semantics of everything. > > IBA view of FMRs: > > verb: ib_alloc_mr(pd, max_num_sg) - creates empty MR object > roughly Allocate L_Key > > verb: ib_dereg_mr(mr) - destroys MR object > > verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset) - builds a map for MR > roughly (Re)Register Physical Memory Region > > verb: ib_update_fast_reg_key(mr, newkey) - update key portion of l/rkey > > send wr: IB_WR_REG_MR(mr, key) - moves MR from FREE to VALID and updates > roughly Fast Register Physical Memory Region key portion of l/rkey to key > > send_wr: IB_WR_LOCAL_INV(invalidate_rkey) - invalidate local MR moves MR to FREE > > send_wr: IB_SEND_WITH_INV(invalidate_rkey) - invalidate remote MR moves MR to FREE > > > To make this all recoverable in the face of errors let there be more than one map present for an > FMR indexed by the key portion of the l/rkeys. > > Alternative view of FMRs: > > verb: ib_alloc_mr(pd, max_num_sg) - create an empty MR object with no maps > with l/rkey = [index, key] with index > fixed and key some initial value. > > verb: ib_update_fast_reg_key(mr, newkey) - update key portion of l/rkey > > verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset) - create a new map from allocated memory > or by re-using an INVALID map. Maps are > all the same size (max_num_sg). The > key (index) of this map is the current > key from l/rkey. The initial state of > the map is FREE. (and thus not usable > until a REG_MR work request is used.) > > verb: ib_dereg_mr(mr) - free all maps and the MR. > > send_wr: IB_WR_REG_MR(mr, key) - Find mr->map[key] and change its state > to VALID. Associate QP with map since > it will be hard to manage multiple QPs > trying to use the same MR at the same > time with differing states. Fail if the > map is not FREE. A map with that key must > have been created by ib_map_mr_sg with > the same key previously. Check the current > number of VALID maps and if this exceeds > a limit pause the send queue until there > is room to reg another MR. > > send_wr: IB_WR_LOCAL_INV (execute) - Lookup a map with the same index and key > Change its state to FREE and dissociate > from QP. Leave map contents the same. > (complete) - When the local invalidate operation is > completed (after all previous send queue WQEs > have completed) change its state to INVALID > and place map resources on a free list or > free memory. > > send_wr: IB_SEND_WITH_INV - same except at remote end. > > retry: - if a retry occurs for a send queue. Back up > the requester to the first incomplete PSN. > Change the state of all maps which were > VALID at that PSN back to VALID. This will > require maintaining a list of valid maps > at the boundary of completed and un-completed > WQEs. > > Arrival of RDMA packet Lookup MR from index and map from key and if > the state is VALID carry out the memory copy. > > > This is an improvement over the current state. At the moment we have only two maps one for making new > ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that > you can which is false. This can be fixed easily by forcing all local operations to be fenced > which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance. > By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay. Hi Bob I thought i have almost understood all your approach expect the *retry/back up* part in where i have not have a full imagination. It sounds good to me. But i think the *retry* should be a new feature to our existing bug reports about FMRs where they all were trying to fix. https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@xxxxxxxxx/T/ https://lore.kernel.org/all/dfba7eb7-8467-59b5-2c2a-071ed1e4949f@xxxxxxxxx/T/ https://lore.kernel.org/lkml/94a5ea93-b8bb-3a01-9497-e2021f29598a@xxxxxxxxx/t/ I'm convinced that this approach can help on this bug, shall we focus on fixing the above known FMRs bug first, and then improve the *retry* feature. Thanks Zhijian > > Even if you have a clean network the current design of the retransmit timer which is never cleared and which > can fire frequently can make a mess of MB sized IOs used for storage. > > Bob