On Sun, Feb 11, 2024 at 02:24:16PM -0500, Kevan Rehm wrote: > > >> An application started by pytorch does a fork, then the child > >> process attempts to use libfabric to open a new DAOS infiniband > >> endpoint. The original endpoint is owned and still in use by the > >> parent process. > >> > >> When the parent process created the endpoint (fi_fabric, > >> fi_domain, fi_endpoint calls), the mlx5 driver allocated memory > >> pages for use in SRQ creation, and issued a madvise to say that > >> the pages are DONTFORK. These pages are associated with the > >> domain’sibv_device which is cached in the driver. After the fork > >> when the child process calls fi_domain for its new endpoint, it > >> gets the ibv_device that was cached at the time it was created by > >> the parent. The child process immediately segfaults when trying > >> to create a SRQ, because the pages associated with that > >> ibv_device are not in the child’s memory. There doesn’t appear > >> to be any way for a child process to create a fresh endpoint > >> because of the caching being done for ibv_devices. > > > For anyone who is interested in this issue, please follow the links below: > > https://github.com/ofiwg/libfabric/issues/9792 > > https://daosio.atlassian.net/browse/DAOS-15117 > > > > Regarding the issue, I don't know if mlx5 actively used to run > > libfabric, but the mentioned call to ibv_dontfork_range() existed from > > prehistoric era. > > Yes, libfabric has used mlx5 for a long time. > > > Do you have any environment variables set related to rdma-core? > > > IBV_FORK_SAFE is set to 1 > > > Is it reated to ibv_fork_init()? It must be called when fork() is called. > > Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing. > There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis. This was all fixed in the kernel, upgrade your kernel and forking works much more reliably, but I'm not sure this case will work. It is a libfabric problem if it is expecting memory to be registers for RDMA and be used by both processes in a fork. That cannot work. Don't do that, or make the memory MAP_SHARED so that the fork children can access it. The bugs seem a bit confused, there is no issue with ibv_device sharing. Only with actually sharing underlying registered memory. Ie sharing a SRQ memory pool between the child and parent. "fork safe" does not magically make all scenarios work, it is targetted at a specific use case where a rdma using process forks and the fork does not continue to use rdma. Jason