On Mon, Feb 12, 2024 at 09:37:25AM -0500, Kevan Rehm wrote: > > This was all fixed in the kernel, upgrade your kernel and forking > > works much more reliably, but I'm not sure this case will work. > > I agree, that won’t help here. > > > It is a libfabric problem if it is expecting memory to be registers > > for RDMA and be used by both processes in a fork. That cannot work. > > > > Don't do that, or make the memory MAP_SHARED so that the fork children > > can access it. > > Libfabric agrees, it wants to use separate registered memory in the > child, but there doesn’t seem to be a way to do this. How can that be true? libfabric is the only entity that causes memory to be registered :) > > The bugs seem a bit confused, there is no issue with ibv_device > > sharing. Only with actually sharing underlying registered memory. Ie > > sharing a SRQ memory pool between the child and parent. > > Libfabric calls rdma_get_devices(), then walks the list looking for > the entry for the correct domain (mlx5_1). It saves a pointer to > the matching dev_list entry which is an ibv_context structure. > Wrapped on that ibv_context is the mlx5 context which contains the > registered pages that had dontfork set when the parent established ^^^^^^^^^^^^^^^^ It does not. context don't have pages, your problem comes from something else. Jason