Re: Segfault in mlx5 driver on infiniband after application fork

Kevan Rehm <kevanrehm@xxxxxxxxx> · Sun, 11 Feb 2024 14:24:16 -0500

>> An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process.
>>
>> When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’sibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.
>>

> For anyone who is interested in this issue, please follow the links below:
> https://github.com/ofiwg/libfabric/issues/9792
> https://daosio.atlassian.net/browse/DAOS-15117
> 
> Regarding the issue, I don't know if mlx5 actively used to run
> libfabric, but the mentioned call to ibv_dontfork_range() existed from
> prehistoric era.

Yes, libfabric has used mlx5 for a long time.

> Do you have any environment variables set related to rdma-core?
> 
IBV_FORK_SAFE is set to 1

> Is it reated to ibv_fork_init()? It must be called when fork() is called.

Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing.
There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis.