On Fri, Feb 07, 2025 at 11:35:40AM -0800, Jörn Engel wrote: > On Fri, Feb 07, 2025 at 01:12:33PM +0000, Lorenzo Stoakes wrote: > > > > So TL;DR is - aggregate operations failing means any or all of the > > operation failed, you can no longer rely on the mapping state being what > > you expected. > > Coming back to the "what should the interface be?" question, I can see > three reasonable answers: > 1. Failure should result in no change. We have a bug and will fix it. > 2. Failure should result in no change. But fixing things is exceedingly > hard and we may have to live with current reality for a long time. > 3. Failure should result in undefined behavior. > > I think you convincingly argue against the first answer. It might still > be useful to also argue against the third answer. To be clear, you won't get any kind of undefined behaviour (what that means wrt the kernel is not entirely clear - but if it were to mean as equivalent to the compiler sort 'anything might happen' - then no) or incomplete state. You simply cannot differentiate (without, at least, further investigation) between partial success and partial failure of an aggregate operation vs. total failure of an aggregate operation based on an error code. > > > For background, I wrote a somewhat weird memory allocator in 2017, > called "big_allocate". Underlying problem is that your favorite malloc > tends to do a reasonable job for small to medium objects, but eventually > gives up and calls mmap()/munmap() for large objects. With a heavily > threaded process, the combination of mmap_sem and TLB shootdown via IPI > is a big performance-killer. Solution is a specialized allocator for > large objects instead of mmap()/munmap(). > > The original (and still current) design of big_allocate has a mapping > structure somewhat similar to "struct page" in the kernel. It relies on > having a large chunk of virtual memory space that it directly controls, > so that it can have a simple 1:1 mapping between virtual memory and > "struct page". > > To get a large chunk of virtual memory space, big_allocate does a > MAP_NONE mmap(). It then later does the MAP_RW mmap() to allocate > memory. Often combined with MAP_HUGETLB, for obvious performance > reasons. (Side note: I wish MAP_RW existed in the headers.) > > If memory serves, big_allocate resulted in a 2-3% macrobenchmark > improvement. > > Current big_allocate has a number of ugly warts I rather dislike. One > of those warts is that you now have existing users that rely on mmap() > over existing MAP_NONE mappings working. At least with the special set > of conditions we care about. I guess you mean PROT_NONE? :) For the case in the thread you would have to have mapped a hugetlb area over the PROT_NONE one without MAP_NORESERVE and with insufficiently reserved hugetlb pages, a combination which should be expected to possibly fail. If you perform an mprotect() to R/W the range, you will end up with a 'one and done' operation. I'd also suggest that hugetlb doesn't seem to fit a malloc library like to me, as you rely on reserved pages, rather wouldn't it make more sense to try to allocate memory that gets THP pages? You could MADV_COLLAPSE to try to make sure... However, if aligned correctly, we should automagically give you those. > > I have some plans to rewrite big_allocate with a different design. But > for now we have existing code that may make your life harder than you > wished for. > > Jörn > > -- > Without congressional action or a strong judicial precedent, I would > _strongly_ recommend against anyone trusting their private data to a > company with physical ties to the United States. > -- Ladar Levison >