Re: Segfault in mlx5 driver on infiniband after application fork

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/8/2024 4:52 PM, Leon Romanovsky wrote:
External email: Use caution opening links or attachments


On Wed, Feb 07, 2024 at 07:17:01PM +0000, Rehm, Kevan wrote:
Greetings,

I don’t see a way to open a ticket at rdma-core; it was suggested that I send this email instead.

I have been chasing a problem in rdma-core-47.1.   Originally, I opened a ticket in libfabric, but it was pointed out that mlx5 is not part of libfabric.   Full description of the problem plus debug notes are documented at the github repository for libfabric, see issue 9792, please have a look there rather than repeating all of the background information in this email.

An application started by pytorch does a fork, then the child process attempts to use libfabric to open a new DAOS infiniband endpoint.    The original endpoint is owned and still in use by the parent process.

When the parent process created the endpoint (fi_fabric, fi_domain, fi_endpoint calls), the mlx5 driver allocated memory pages for use in SRQ creation, and issued a madvise to say that the pages are DONTFORK.  These pages are associated with the domain’s ibv_device which is cached in the driver.   After the fork when the child process calls fi_domain for its new endpoint, it gets the ibv_device that was cached at the time it was created by the parent.   The child process immediately segfaults when trying to create a SRQ, because the pages associated with that ibv_device are not in the child’s memory.  There doesn’t appear to be any way for a child process to create a fresh endpoint because of the caching being done for ibv_devices.

Is this the proper way to “open a ticket” against rdma-core?

It is right place, but I won't call it "proper way".
For anyone who is interested in this issue, please follow the links below:
https://github.com/ofiwg/libfabric/issues/9792
https://daosio.atlassian.net/browse/DAOS-15117

Regarding the issue, I don't know if mlx5 actively used to run
libfabric, but the mentioned call to ibv_dontfork_range() existed from
prehistoric era.

Do you have any environment variables set related to rdma-core?


Is it reated to ibv_fork_init()? It must be called when fork() is called.





[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux