rdma-core: ibv_dontfork_range should not round up to page boundaries

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



# The problem

If RDMAV_FORK_SAFE or IBV_FORK_SAFE is set, rdma-core will call
`ibv_dontfork_range` to mark regions of memory that will be used for
RDMA as `MADV_DONTFORK` to prevent CoW from relocating them.

`ibv_dontfork_range` calls `ibv_madvise_range`, which will round the
provided memory range up to page boundaries automatically
(libibverbs/memory.c:L638-L640):

        start = (uintptr_t) base & ~(range_page_size - 1);
        end   = ((uintptr_t) (base + size + range_page_size - 1) &
                 ~(range_page_size - 1)) - 1;

This behavior avoids EINVAL from the kernel, but has the effect of
potentially marking random unrelated data that shares a page with the
registered region as `MADV_DONTFORK`.

In particular, we ran into a case where a `aws-ofi-nccl` was
registering a region inside of a (sub-page-size) `malloc`'d struct.
With some probability, that struct would end up on a page that also
contains the glibc `struct malloc_state` managing that heap arena.
When this happens, `fork` will result in a corrupted heap, and we
would see post-fork segfaults from the child inside
`__malloc_fork_unlock_child`:

#0  __malloc_fork_unlock_child () at arena.c:193
#1  0x00007fe2a996fab5 in __libc_fork () at ../sysdeps/nptl/fork.c:188
#2  0x00007fe2aa6e3941 in subprocess_fork_exec (self=<optimized out>,
args=<optimized out>) at
/usr/local/src/conda/python-3.8.10/Modules/_posixsubprocess.c:693
...

Googling for [__malloc_fork_unlock_child segfault] finds a handful of
reports -- most or all of which also implicate RDMA setups -- that I
suspect of having the same root cause.

# The proposed behavior change

The proximate bug here is arguably in the libiverbs clients that are
making the problematic registrations, but I'd like to see libiverbs be
more helpful here by rejecting non-page-aligned regions, at least in
fork-safe mode. Marking memory we don't control as `MADV_DONTFORK` is
*always* incorrect behavior, even if most of the time it may not have
immediate consequences.

I expect this change could pose compatibility problems for existing
libraries. Potentially it could be rolled out as a warning initially,
which would help surface the problem and correct downstreams, as well
as making it easier for administrators to debug this problem.

# Details of our environment

The code inside of aws-ofi-nccl that performs the problematic
registration (by way of libfabric) is here:

https://github.com/aws/aws-ofi-nccl/blob/f16565b2560d21f038a171007d5800ddd9ba1206/src/nccl_ofi_net.c#L1736-L1765

Following our report to AWS, they've fixed the bug on their end here:
https://github.com/aws/aws-ofi-nccl/commit/caa40416bae9562a615d730c8a706d38fba1a9b9

Thanks,
- Nelson Elhage



[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux