# The problem If RDMAV_FORK_SAFE or IBV_FORK_SAFE is set, rdma-core will call `ibv_dontfork_range` to mark regions of memory that will be used for RDMA as `MADV_DONTFORK` to prevent CoW from relocating them. `ibv_dontfork_range` calls `ibv_madvise_range`, which will round the provided memory range up to page boundaries automatically (libibverbs/memory.c:L638-L640): start = (uintptr_t) base & ~(range_page_size - 1); end = ((uintptr_t) (base + size + range_page_size - 1) & ~(range_page_size - 1)) - 1; This behavior avoids EINVAL from the kernel, but has the effect of potentially marking random unrelated data that shares a page with the registered region as `MADV_DONTFORK`. In particular, we ran into a case where a `aws-ofi-nccl` was registering a region inside of a (sub-page-size) `malloc`'d struct. With some probability, that struct would end up on a page that also contains the glibc `struct malloc_state` managing that heap arena. When this happens, `fork` will result in a corrupted heap, and we would see post-fork segfaults from the child inside `__malloc_fork_unlock_child`: #0 __malloc_fork_unlock_child () at arena.c:193 #1 0x00007fe2a996fab5 in __libc_fork () at ../sysdeps/nptl/fork.c:188 #2 0x00007fe2aa6e3941 in subprocess_fork_exec (self=<optimized out>, args=<optimized out>) at /usr/local/src/conda/python-3.8.10/Modules/_posixsubprocess.c:693 ... Googling for [__malloc_fork_unlock_child segfault] finds a handful of reports -- most or all of which also implicate RDMA setups -- that I suspect of having the same root cause. # The proposed behavior change The proximate bug here is arguably in the libiverbs clients that are making the problematic registrations, but I'd like to see libiverbs be more helpful here by rejecting non-page-aligned regions, at least in fork-safe mode. Marking memory we don't control as `MADV_DONTFORK` is *always* incorrect behavior, even if most of the time it may not have immediate consequences. I expect this change could pose compatibility problems for existing libraries. Potentially it could be rolled out as a warning initially, which would help surface the problem and correct downstreams, as well as making it easier for administrators to debug this problem. # Details of our environment The code inside of aws-ofi-nccl that performs the problematic registration (by way of libfabric) is here: https://github.com/aws/aws-ofi-nccl/blob/f16565b2560d21f038a171007d5800ddd9ba1206/src/nccl_ofi_net.c#L1736-L1765 Following our report to AWS, they've fixed the bug on their end here: https://github.com/aws/aws-ofi-nccl/commit/caa40416bae9562a615d730c8a706d38fba1a9b9 Thanks, - Nelson Elhage