Re: JITs and 52-bit VA

Andy Lutomirski <luto@xxxxxxxxxxxxxx> · Wed, 22 Jun 2016 08:13:29 -0700

On Wed, Jun 22, 2016 at 7:53 AM, Christopher Covington
<cov@xxxxxxxxxxxxxx> wrote:
> +Andy, Cyrill, Dmitry who have been discussing variable TASK_SIZE on x86
> on linux-mm
>
> http://marc.info/?l=linux-mm&m=146290118818484&w=2
>
>>>> On 04/28/2016 09:00 AM, Maxim Kuvyrkov wrote:
>>>>> This is a summary of discussions we had on IRC between kernel and
>>>>> toolchain engineers regarding support for JITs and 52-bit virtual
>>>>> address space (mostly in the context of LuaJIT, but this concerns other
>>>>> JITs too).
>>>>>
>>>>> The summary is that we need to consider ways of reducing the size of
>>>>> VA for a given process or container on a Linux system.
>>>>>
>>>>> The high-level problem is that JITs tend to use upper bits of
>>>>> addresses to encode various pieces of data, and that the number of
>>>>> available bits is shrinking due to VA size increasing. With the usual
>>>>> 42-bit VA (which is what most JITs assume) they have 22 bits to encode
>>>>> various performance-critical data. With 48-bit VA (e.g., ThunderX world)
>>>>> things start to get complicated, and JITs need to be non-trivially
>>>>> patched at the source level to continue working with less bits available
>>>>> for their performance-critical storage. With upcoming 52-bit VA things
>>>>> might get dire enough for some JITs to declare such configurations
>>>>> unsupported.
>>>>>
>>>>> On the other hand, most JITs are not expected to requires terabytes
>>>>> of RAM and huge VA for their applications. Most JIT applications will
>>>>> happily live in 42-bit world with mere 4 terabytes of RAM that it
>>>>> provides. Therefore, what JITs need in the modern world is a way to make
>>>>> mmap() return addresses below a certain threshold, and error out with
>>>>> ENOMEM when "lower" memory is exhausted. This is very similar to
>>>>> ADDR_LIMIT_32BIT personality, but extended to common VA sizes on 64-bit
>>>>> systems: 39-bit, 42-bit, 48-bit, 52-bit, etc.
>>>>>
>>>>> Since we do not want to penalize the whole system (using an
>>>>> artificially low-size VA), it would be best to have a way to enable VA
>>>>> limit on per-process basis (similar to ADDR_LIMIT_32BIT personality). If
>>>>> that's not possible -- then on per-container / cgroup basis. If that's
>>>>> not possible -- then on system level (similar to vm.mmap_min_addr, but
>>>>> from the other end).
>>>>>
>>>>> Dear kernel people, what can be done to address the JITs need to
>>>>> reduce effective VA size?
>
>>> On 04/28/2016 09:17 AM, Arnd Bergmann wrote:
>>>> Thanks for the summary, now it all makes much more sense.
>>>>
>>>> One simple (from the kernel's perspective, not from the JIT) approach
>>>> might be to always use MAP_FIXED whenever an allocation is made for
>>>> memory that needs these special pointers, and then manage the available
>>>> address space explicitly. Would that work, or do you require everything
>>>> including the binary itself to be below the address?
>>>>
>>>> Regarding which memory sizes are needed, my impression from your
>>>> explanation is that a single personality flag (e.g. ADDR_LIMIT_42BIT)
>>>> would be sufficient for the usecase, and you don't actually need to
>>>> tie this to the architecture-provided virtual addressing limits
>>>> at all. If it's only one such flag, we can probably find a way to fit
>>>> it into the personality flags, though ironically we are actually
>>>> running out of bits in there as well.
>
>> On 04/28/2016 09:24 AM, Peter Maydell wrote:
>>> The trouble IME with this idea is that in practice you're
>>> linking with glibc, which means glibc is managing (and using)
>>> the address space, not the JIT. So MAP_FIXED is pretty awkward
>>> to use.
>
> On 04/28/2016 03:27 PM, Steve Capper wrote:
>> One can find holes in the VA space by examining /proc/self/maps, thus
>> selection of pointers for MAP_FIXED can be deduced.
>>
>> The other problem is, as Arnd alluded to, if a JIT'ed object needs to
>> then refer to something allocated outside of the JIT. This could be
>> remedied by another level of indirection/trampoline.
>>
>> Taking two steps back though, I would view VA space squeezing as a
>> stop-gap before removing tags from the upper bits of a pointer
>> altogether (tagging the bottom bits, by controlling alignment is
>> perfectly safe). The larger the VA space, the more scope mechanisms
>> such as Address Space Layout Randomisation have to improve security.
>
> I was working on an (AArch64-specific) auxiliary vector entry to export
> TASK_SIZE to userspace at exec time. The goal was to allow for more
> elegant, robust, and efficient replacements for the following changes:
>
> https://hg.mozilla.org/integration/mozilla-inbound/rev/dfaafbaaa291
>
> https://github.com/xemul/criu/commit/c0c0546c31e6df4932669f4740197bb830a24c8d
>
> However based on the above discussion, it appears that some sort of
> prctl(PR_GET_TASK_SIZE, ...) and prctl(PR_SET_TASK_SIZE, ...) may be
> preferable for AArch64. (And perhaps other justifications for the new
> calls influences the x86 decisions.) What do folks think?

I would advocate a slightly different approach:

 - Keep TASK_SIZE either unconditionally matching the hardware or keep
TASK_SIZE as the actual logical split between user and kernel
addresses.  Don't let it change at runtime under any circumstances.
The reason is that there have been plenty of bugs and
overcomplications that result from letting it vary.  For example, if
(addr < TASK_SIZE) really ought to be the correct check (assuming
USER_DS, anyway) for whether dereferencing addr will access user
memory, at least on architectures with a global address space (which
is most of them, I think).

 - If needed, introduce a clean concept of the maximum address that
mmap will return, but don't call it TASK_SIZE.  So, if a user program
wants to limit itself to less than the full hardware VA space (or less
than 63 bits, for that matter), it can.

As an example, a 32-bit x86 program really could have something mapped
above the 32-bit boundary.  It just wouldn't be useful, but the kernel
should still understand that it's *user* memory.

So you'd have PR_SET_MMAP_LIMIT and PR_GET_MMAP_LIMIT or similar instead.

Also, before getting *too* excited about this kind of VA limit, keep
in mind that SPARC has invented this thingly called "Application Data
Integrity".  It reuses some of the high address bits in hardware for
other purposes.  I wouldn't be totally shocked if other architectures
followed suit. (Although no one should copy SPARC's tagging scheme,
please: it's awful.  these things should be controlled at the MMU
level, not the cache tag level.  Otherwise aliased mappings get very
confused.)

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>