Re: [RFC] Alternative design for fast register physical memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 25/05/2022 06:28, Bob Pearson wrote:
> Jason,
>
> There has been a lot of chatter on the FMRs in rxe recently and I am also trying to help out at home with
> the folks who are trying to run Lustre on rxe. The last fix for this was to split the state in an FMR into
> two with separate rkeys and memory maps so that apps can pipeline the preparation of IO and doing IO.
>
> However, I am convinced that the current design only works by accident when it works. The thing that really
> makes a hash of it is retries. Unfortunately the documentation on all this is almost non existent. Lustre
> (actually o2iblnd) makes heavy use of FMRs and typically has several different MRs in flight in the send queue
> with a mixture of local and remote writes accessing these MRs interleaved with REG_MRs and INVALIDATE_MR local
> work requests. When a packet gets dropped from a WQE deep in the send queue the result is nothing works at all.
>
> We have a work around by fencing all the local operations which more or less works but will have bad performance.
> The maps used in FMRs have fairly short lifetimes but definitely longer than we we can support today. I am
> trying to work out the semantics of everything.
>
> IBA view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg)			- creates empty MR object
> 	roughly Allocate L_Key
>
> verb: ib_dereg_mr(mr)					- destroys MR object
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- builds a map for MR
> 	roughly (Re)Register Physical Memory Region
>
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>
> send wr: IB_WR_REG_MR(mr, key)				- moves MR from FREE to VALID and updates
> 	roughly Fast Register Physical Memory Region	  key portion of l/rkey to key
>
> send_wr: IB_WR_LOCAL_INV(invalidate_rkey)		- invalidate local MR moves MR to FREE
>
> send_wr: IB_SEND_WITH_INV(invalidate_rkey)		- invalidate remote MR moves MR to FREE
>
>
> To make this all recoverable in the face of errors let there be more than one map present for an
> FMR indexed by the key portion of the l/rkeys.
>
> Alternative view of FMRs:
>
> verb: ib_alloc_mr(pd, max_num_sg)			- create an empty MR object with no maps
> 							  with l/rkey = [index, key] with index
> 							  fixed and key some initial value.
>
> verb: ib_update_fast_reg_key(mr, newkey)		- update key portion of l/rkey
>
> verb: ib_map_mr_sg(mr, sg, sg_nents, sg_offset)		- create a new map from allocated memory
> 							  or by re-using an INVALID map. Maps are
> 							  all the same size (max_num_sg). The
> 							  key (index) of this map is the current
> 							  key from l/rkey. The initial state of
> 							  the map is FREE. (and thus not usable
> 							  until a REG_MR work request is used.)
>
> verb: ib_dereg_mr(mr)					- free all maps and the MR.
>
> send_wr: IB_WR_REG_MR(mr, key)				- Find mr->map[key] and change its state
> 							  to VALID. Associate QP with map since
> 							  it will be hard to manage multiple QPs
> 							  trying to use the same MR at the same
> 							  time with differing states. Fail if the
> 							  map is not FREE. A map with that key must
> 							  have been created by ib_map_mr_sg with
> 							  the same key previously. Check the current
> 							  number of VALID maps and if this exceeds
> 							  a limit pause the send queue until there
> 							  is room to reg another MR.
>
> send_wr: IB_WR_LOCAL_INV (execute)			- Lookup a map with the same index and key
> 							  Change its state to FREE and dissociate
> 							  from QP. Leave map contents the same.
> 			 (complete)			- When the local invalidate operation is
> 							  completed (after all previous send queue WQEs
> 							  have completed) change its state to INVALID
> 							  and place map resources on a free list or
> 							  free memory.
>
> send_wr: IB_SEND_WITH_INV				- same except at remote end.
>
> retry:							- if a retry occurs for a send queue. Back up
> 							  the requester to the first incomplete PSN.
> 							  Change the state of all maps which were
> 							  VALID at that PSN back to VALID. This will
> 							  require maintaining a list of valid maps
> 							  at the boundary of completed and un-completed
> 							  WQEs.
>
> Arrival of RDMA packet					  Lookup MR from index and map from key and if
> 							  the state is VALID carry out the memory copy.
>
>
> This is an improvement over the current state. At the moment we have only two maps one for making new
> ones and one for doing IO. There is no room to back up but at the moment the retry logic assumes that
> you can which is false. This can be fixed easily by forcing all local operations to be fenced
> which is what we are doing at the moment at HPE. This can insert long delays between every new FMR instance.
> By allowing three maps and then fencing we can back up one broken IO operation without too much of a delay.

Hi Bob

I thought i have almost understood all your approach expect the *retry/back up* part in where i have not have a full imagination.
It sounds good to me.
But i think the *retry* should be a new feature to our existing bug reports about FMRs where they all were trying to fix.
https://lore.kernel.org/all/20220210073655.42281-1-guoqing.jiang@xxxxxxxxx/T/
https://lore.kernel.org/all/dfba7eb7-8467-59b5-2c2a-071ed1e4949f@xxxxxxxxx/T/
https://lore.kernel.org/lkml/94a5ea93-b8bb-3a01-9497-e2021f29598a@xxxxxxxxx/t/

I'm convinced that this approach can help on this bug, shall we focus on fixing the above known FMRs bug first, and then improve the *retry* feature.

Thanks
Zhijian


>
> Even if you have a clean network the current design of the retransmit timer which is never cleared and which
> can fire frequently can make a mess of MB sized IOs used for storage.
>
> Bob




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux