Re: Segfault in mlx5 driver on infiniband after application fork

Kevan Rehm <kevanrehm@xxxxxxxxx> · Mon, 12 Feb 2024 09:37:25 -0500

> On Feb 12, 2024, at 8:33 AM, Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> 
> On Sun, Feb 11, 2024 at 02:24:16PM -0500, Kevan Rehm wrote:
>> 
>>>> An application started by pytorch does a fork, then the child
>>>> process attempts to use libfabric to open a new DAOS infiniband
>>>> endpoint.  The original endpoint is owned and still in use by the
>>>> parent process.
>>>> 
>>>> When the parent process created the endpoint (fi_fabric,
>>>> fi_domain, fi_endpoint calls), the mlx5 driver allocated memory
>>>> pages for use in SRQ creation, and issued a madvise to say that
>>>> the pages are DONTFORK.  These pages are associated with the
>>>> domain’sibv_device which is cached in the driver.  After the fork
>>>> when the child process calls fi_domain for its new endpoint, it
>>>> gets the ibv_device that was cached at the time it was created by
>>>> the parent.  The child process immediately segfaults when trying
>>>> to create a SRQ, because the pages associated with that
>>>> ibv_device are not in the child’s memory.  There doesn’t appear
>>>> to be any way for a child process to create a fresh endpoint
>>>> because of the caching being done for ibv_devices.
>> 
>>> For anyone who is interested in this issue, please follow the links below:
>>> https://github.com/ofiwg/libfabric/issues/9792
>>> https://daosio.atlassian.net/browse/DAOS-15117
>>> 
>>> Regarding the issue, I don't know if mlx5 actively used to run
>>> libfabric, but the mentioned call to ibv_dontfork_range() existed from
>>> prehistoric era.
>> 
>> Yes, libfabric has used mlx5 for a long time.
>> 
>>> Do you have any environment variables set related to rdma-core?
>>> 
>> IBV_FORK_SAFE is set to 1
>> 
>>> Is it reated to ibv_fork_init()? It must be called when fork() is called.
>> 
>> Calling ibv_fork_init() doesn’t help, because it immediately checks mm_root, sees it is non-zero (from the parent process’s prior call), and returns doing nothing.
>> There is now a simplified test case, see https://github.com/ofiwg/libfabric/issues/9792 for ongoing analysis.
> 
> This was all fixed in the kernel, upgrade your kernel and forking
> works much more reliably, but I'm not sure this case will work.

I agree, that won’t help here.

> It is a libfabric problem if it is expecting memory to be registers
> for RDMA and be used by both processes in a fork. That cannot work.
> 
> Don't do that, or make the memory MAP_SHARED so that the fork children
> can access it.

Libfabric agrees, it wants to use separate registered memory in the child, but there doesn’t seem to be a way to do this.
> 
> The bugs seem a bit confused, there is no issue with ibv_device
> sharing. Only with actually sharing underlying registered memory. Ie
> sharing a SRQ memory pool between the child and parent.

Libfabric calls rdma_get_devices(), then walks the list looking for the entry for the correct domain (mlx5_1).  It saves a pointer to the matching dev_list entry which is an ibv_context structure.  Wrapped on that ibv_context is the mlx5 context which contains the registered pages that had dontfork set when the parent established its connection.  When the child process calls rdma_get_devices(), desiring to create a fresh connection to the same mlx5_1 domain, it will instead get back the same ibv_context that the parent got, not a fresh one, and so creation of a SRQ will segfault.   How can libfabric force verbs to return a fresh ibv_context for mlx5_1 instead of the one returned to the parent process?
> 
> "fork safe" does not magically make all scenarios work, it is
> targetted at a specific use case where a rdma using process forks and
> the fork does not continue to use rdma.
> 
> Jason