On 03/02/2021 16:00, Jason Gunthorpe wrote: > On Wed, Feb 03, 2021 at 02:43:58PM +0200, Gal Pressman wrote: >>> On Tue, Feb 02, 2021 at 12:05:36PM -0500, Peter Xu wrote: >>> >>>>> Gal, you could also MADV_DONTFORK this range if you are explicitly >>>>> allocating them via special mmap. >>>> >>>> Yeah I wanted to mention this one too but I just forgot when reply: the issue >>>> thread previously pasted smells like some people would like to drop >>>> MADV_DONTFORK, but if it's able to still be applied I don't know why >>>> not.. >>> >>> I want to drop the MADV_DONTFORK for dynamic data memory allocated by >>> the application layers (eg with malloc) without knowledge of how they >>> will be used. >>> >>> This case is a buffer internal to the communication system that we >>> know at allocation time how it will be used; so an explicit, >>> deliberate, MADV_DONTFORK is fine >> >> We are referring to libfabric's bounce buffers, correct? >> Libfabric could be considered as the "app" here, it's not clear why these >> buffers should be DONTFORK'd before ibv_reg_mr() but others don't. > > I assumed they were internal to the EFA code itself. The hugepages allocation is part of libfabric generic bufpool implementation: https://github.com/ofiwg/libfabric/blob/cde8665ca5ec2fb957260490d0c8700d8ac69863/include/linux/osd.h#L64 I guess we could madvise them at the libfabric provider's layer. >> Anyway, it should be simple enough to madvise them after allocation, although I >> think it's part of libfabric's generic code (which isn't necessarily used on >> top of rdma-core). > > Ah, so that is a reasonable justification for wanting to fix this in > the kernel.. > > Lets give Peter some time first. > > The other direction to validate this approach is to remove the > MAP_HUGETLB flags and rely on THP instead, and/or mark them as > MAP_SHARED. > > I'm not sure generic code should be use using MAP_HUGETLB.. It's using MAP_HUGETLB but has a fallback in case it fails: ret = ofi_alloc_hugepage_buf((void **) &buf_region->alloc_region, pool->alloc_size); /* If we can't allocate huge pages, fall back to normal * allocations for all future attempts. */ if (ret) { pool->attr.flags &= ~OFI_BUFPOOL_HUGEPAGES; goto retry; } buf_region->flags = OFI_BUFPOOL_HUGEPAGES; > This would be enough to confirm that everything else is working as > expected Agree.