Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

"Andy Lutomirski" <luto@xxxxxxxxxx> · Tue, 20 Jun 2023 10:24:29 -0700

On Mon, Jun 19, 2023, at 1:18 PM, Nadav Amit wrote:
>> On Jun 19, 2023, at 10:09 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote:
>> 
>> But jit_text_alloc() can't do this, because the order of operations doesn't match.  With jit_text_alloc(), the executable mapping shows up before the text is populated, so there is no atomic change from not-there to populated-and-executable.  Which means that there is an opportunity for CPUs, speculatively or otherwise, to start filling various caches with intermediate states of the text, which means that various architectures (even x86!) may need serialization.
>> 
>> For eBPF- and module- like use cases, where JITting/code gen is quite coarse-grained, perhaps something vaguely like:
>> 
>> jit_text_alloc() -> returns a handle and an executable virtual address, but does *not* map it there
>> jit_text_write() -> write to that handle
>> jit_text_map() -> map it and synchronize if needed (no sync needed on x86, I think)
>
> Andy, would you mind explaining why you think a sync is not needed? I 
> mean I have a “feeling” that perhaps TSO can guarantee something based 
> on the order of write and page-table update. Is that the argument?

Sorry, when I say "no sync" I mean no cross-CPU synchronization.  I'm assuming the underlying sequence of events is:

allocate physical pages (jit_text_alloc)

write to them (with MOV, memcpy, whatever), via the direct map or via a temporary mm

do an appropriate *local* barrier (which, on x86, is probably implied by TSO, as the subsequent pagetable change is at least a release; also, any any previous temporary mm stuff would have done MOV CR3 afterwards, which is a full "serializing" barrier)

optionally zap the direct map via IPI, assuming the pages are direct mapped (but this could be avoided with a smart enough allocator and temporary_mm above)

install the final RX PTE (jit_text_map), which does a MOV or maybe a LOCK CMPXCHG16B.  Note that the virtual address in question was not readable or executable before this, and all CPUs have serialized since the last time it was executable.

either jump to the new text locally, or:

1. Do a store-release to tell other CPUs that the text is mapped
2. Other CPU does a load-acquire to detect that the text is mapped and jumps to the text

This is all approximately the same thing that plain old mmap(..., PROT_EXEC, ...) does.

>
> On this regard, one thing that I clearly do not understand is why 
> *today* it is ok for users of bpf_arch_text_copy() not to call 
> text_poke_sync(). Am I missing something?

I cannot explain this, because I suspect the current code is wrong.  But it's only wrong across CPUs, because bpf_arch_text_copy goes through text_poke_copy, which calls unuse_temporary_mm(), which is serializing.  And it's plausible that most eBPF use cases don't actually cause the loaded program to get used on a different CPU without first serializing on the CPU that ends up using it.  (Context switches and interrupts are serializing.)

FRED could make interrupts non-serializing. I sincerely hope that FRED doesn't cause this all to fall apart.

--Andy