Re: [PATCH RFC v2 0/4] mm: Introduce MAP_BELOW_HINT

Steven Price <steven.price@xxxxxxx> · Thu, 24 Oct 2024 11:52:40 +0100

On 23/10/2024 19:10, Liam R. Howlett wrote:
> * Steven Price <steven.price@xxxxxxx> [241023 05:31]:
>>>>   * Box64 seems to have a custom allocator based on reading 
>>>>     /proc/self/maps to allocate a block of VA space with a low enough 
>>>>     address [1]
>>>>
>>>>   * PHP has code reading /proc/self/maps - I think this is to find a 
>>>>     segment which is close enough to the text segment [2]
>>>>
>>>>   * FEX-Emu mmap()s the upper 128TB of VA on Arm to avoid full 48 bit
>>>>     addresses [3][4]
>>>
>>> Can't the limited number of applications that need to restrict the upper
>>> bound use an LD_PRELOAD compatible library to do this?
>>
>> I'm not entirely sure what point you are making here. Yes an LD_PRELOAD
>> approach could be used instead of a personality type as a 'hack' to
>> preallocate the upper address space. The obvious disadvantage is that
>> you can't (easily) layer LD_PRELOAD so it won't work in the general case.
> 
> My point is that riscv could work around the limited number of
> applications that requires this.  It's not really viable for you.

Ah ok - thanks for the clarification.

>>
>>>>
>>>>   * pmdk has some funky code to find the lowest address that meets 
>>>>     certain requirements - this does look like an ALSR alternative and 
>>>>     probably couldn't directly use MAP_BELOW_HINT, although maybe this 
>>>>     suggests we need a mechanism to map without a VA-range? [5]
>>>>
>>>>   * MIT-Scheme parses /proc/self/maps to find the lowest mapping within 
>>>>     a range [6]
>>>>
>>>>   * LuaJIT uses an approach to 'probe' to find a suitable low address 
>>>>     for allocation [7]
>>>>
>>>
>>> Although I did not take a deep dive into each example above, there are
>>> some very odd things being done, we will never cover all the use cases
>>> with an exact API match.  What we have today can be made to work for
>>> these users as they have figured ways to do it.
>>>
>>> Are they pretty? no.  Are they common? no.  I'm not sure it's worth
>>> plumbing in new MM code in for these users.
>>
>> My issue with the existing 'solutions' is that they all seem to be fragile:
>>
>>  * Using /proc/self/maps is inherently racy if there could be any other
>> code running in the process at the same time.
> 
> Yes, it is not thread safe.  Parsing text is also undesirable.
> 
>>
>>  * Attempting to map the upper part of the address space only works if
>> done early enough - once an allocation arrives there, there's very
>> little you can robustly do (because the stray allocation might be freed).
>>
>>  * LuaJIT's probing mechanism is probably robust, but it's inefficient -
>> LuaJIT has a fallback of linear probing, following by no hint (ASLR),
>> followed by pseudo-random probing. I don't know the history of the code
>> but it looks like it's probably been tweaked to try to avoid performance
>> issues.
>>
>>>> The biggest benefit I see of MAP_BELOW_HINT is that it would allow a
>>>> library to get low addresses without causing any problems for the rest
>>>> of the application. The use case I'm looking at is in a library and 
>>>> therefore a personality mode wouldn't be appropriate (because I don't 
>>>> want to affect the rest of the application). Reading /proc/self/maps
>>>> is also problematic because other threads could be allocating/freeing
>>>> at the same time.
>>>
>>> As long as you don't exhaust the lower limit you are trying to allocate
>>> within - which is exactly the issue riscv is hitting.
>>
>> Obviously if you actually exhaust the lower limit then any
>> MAP_BELOW_HINT API would also fail - there's really not much that can be
>> done in that case.
> 
> Today we reverse the search, so you end up in the higher address
> (bottom-up vs top-down) - although the direction is arch dependent.
> 
> If the allocation is too high/low then you could detect, free, and
> handle the failure.

Agreed, that's fine.

>>
>>> I understand that you are providing examples to prove that this is
>>> needed, but I feel like you are better demonstrating the flexibility
>>> exists to implement solutions in different ways using todays API.
>>
>> My intention is to show that today's API doesn't provide a robust way of
>> doing this. Although I'm quite happy if you can point me at a robust way
>> with the current API. As I mentioned my goal is to be able to map memory
>> in a (multithreaded) library with a (ideally configurable) number of VA
>> bits. I don't particularly want to restrict the whole process, just
>> specific allocations.
> 
> If you don't need to restrict everything, won't the hint work for your
> usecase?  I must be missing something from your requirements.

The hint only works if the hint address is actually free. Otherwise
mmap() falls back to as if the hint address wasn't specified.

E.g.

> 	for(int i = 0; i < 2; i++) {
> 		void *addr = mmap((void*)(1UL << 32), PAGE_SIZE, PROT_NONE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
> 		printf("%p\n", addr);
> 	}

Prints something like:

0x100000000
0x7f20d21e0000

The hint is ignored for the second mmap() because there's already a VMA
at the hint address.

So the question is how to generate a hint value that is (or has a high
likelihood of being) empty? This AFAICT is the LuaJIT approach, but
their approach is to pick random values in the hope of getting a free
address (and then working linearly up for subsequent allocations). Which
doesn't meet my idea of "robust".

>>
>> I had typed up a series similar to this one as a MAP_BELOW flag would
>> fit my use-case well.
>>
>>> I think it would be best to use the existing methods and work around the
>>> issue that was created in riscv while future changes could mirror amd64
>>> and arm64.
>>
>> The riscv issue is a different issue to the one I'm trying to solve. I
>> agree MAP_BELOW_HINT isn't a great fix for that because we already have
>> differences between amd64 and arm64 and obviously no software currently
>> out there uses this new flag.
>>
>> However, if we had introduced this flag in the past (e.g. if MAP_32BIT
>> had been implemented more generically, across architectures and with a
>> hint value, like this new flag) then we probably wouldn't be in this
>> situation. Applications that want to restrict the VA space would be able
>> to opt-in and be portable across architectures.
> 
> I don't think that's true.  Some of the applications want all of the
> allocations below a certain threshold and by the time they are adding
> flags to allocations, it's too late.  What you are looking for is a
> counterpart to mmap_min_addr, but for higher addresses?  This would have
> to be set before any of the allocations occur for a specific binary (ie:
> existing libraries need to be below that threshold too), I think?

Well that's not what *I* am looking for. A mmap_max_addr might be useful
for others for the purpose of restricting all allocations.

I think there are roughly three classes of application:

 1. Applications which do nothing special with pointers. This is most
applications and they could benefit from any future expansions to the VA
size without any modification. E.g. if 64 bit VA addresses were somehow
available they could deal with them today (without recompilation).

 2. Applications which need VA addresses to meet certain requirements.
They might be emulating another architecture (e.g. FEX) and want
pointers that can be exposed to the emulation. They might be aware of
restrictions in JIT code (e.g. PHP). Or they might want to store
pointers in 'weird' ways which involve fewer bits - AFAICT that's the
LuaJIT situation. These applications are usually well aware that they
are doing something "unusual" and would likely use a Linux API if it
existed.

 3. Applications which abuse the top bits of a VA because they've read
the architecture documentation and they "know" that the VA space is limited.

Class 3 would benefit from mmap_max_addr - either because the
architecture has been extended (although that's been worked around by
requiring the hint value to allocate into the top address space) or
because they get ported to another architecture (which I believe is the
RiscV issue). There is some argument these applications are buggy but
"we don't break userspace" so we deal with them in kernel until they get
ported and then ideally the bugs are fixed.

Class 1 is the applications we know and love, they don't need anything
special.

Class 2 is the case I care about. The application knows it wants special
addresses, and in the cases I've detailed there has been significant
code written to try to achieve this. But the kernel isn't currently
providing a good mechanism to do this.

>>
>> Another potential option is a mmap3() which actually allows the caller
>> to place constraints on the VA space (e.g. minimum, maximum and
>> alignment). There's plenty of code out there that has to over-allocate
>> and munmap() the unneeded part for alignment reasons. But I don't have a
>> specific need for that, and I'm guessing you wouldn't be in favour.
> 
> You'd probably want control of the direction of the search too.

Very true, and one of the reasons I don't want to do a mmap3() is that
I'm pretty I'd miss something.

> I think mmap3() would be difficult to have accepted as well.

And that's the other major reason ;)

Thanks,

Steve

> ...
> 
> Thanks,
> Liam
>