Excerpts from Andy Lutomirski's message of December 29, 2020 10:56 am: > On Mon, Dec 28, 2020 at 4:36 PM Nicholas Piggin <npiggin@xxxxxxxxx> wrote: >> >> Excerpts from Andy Lutomirski's message of December 29, 2020 7:06 am: >> > On Mon, Dec 28, 2020 at 12:32 PM Mathieu Desnoyers >> > <mathieu.desnoyers@xxxxxxxxxxxx> wrote: >> >> >> >> ----- On Dec 28, 2020, at 2:44 PM, Andy Lutomirski luto@xxxxxxxxxx wrote: >> >> >> >> > On Mon, Dec 28, 2020 at 11:09 AM Russell King - ARM Linux admin >> >> > <linux@xxxxxxxxxxxxxxx> wrote: >> >> >> >> >> >> On Mon, Dec 28, 2020 at 07:29:34PM +0100, Jann Horn wrote: >> >> >> > After chatting with rmk about this (but without claiming that any of >> >> >> > this is his opinion), based on the manpage, I think membarrier() >> >> >> > currently doesn't really claim to be synchronizing caches? It just >> >> >> > serializes cores. So arguably if userspace wants to use membarrier() >> >> >> > to synchronize code changes, userspace should first do the code >> >> >> > change, then flush icache as appropriate for the architecture, and >> >> >> > then do the membarrier() to ensure that the old code is unused? >> >> >> >> ^ exactly, yes. >> >> >> >> >> > >> >> >> > For 32-bit arm, rmk pointed out that that would be the cacheflush() >> >> >> > syscall. That might cause you to end up with two IPIs instead of one >> >> >> > in total, but we probably don't care _that_ much about extra IPIs on >> >> >> > 32-bit arm? >> >> >> >> This was the original thinking, yes. The cacheflush IPI will flush specific >> >> regions of code, and the membarrier IPI issues context synchronizing >> >> instructions. >> >> APIs should be written in terms of the service they provide to >> userspace, and in highest level terms as possible, rather than directing >> hardware to do some low level operation. Unfortunately we're stuck with >> this for now. We could deprecate it and replace it though. >> >> If userspace wants to modify code and ensure that after the system call >> returns then no other thread will be executing the previous code, then >> there should be an API for that. It could actually combine the two IPIs >> into one for architectures that require both too. > > I agree. The membarrier API for SYNC_CORE is pretty nasty. I would > much prefer a real API for JITs to use. > >> >> >> >> >> Architectures with coherent i/d caches don't need the cacheflush step. >> > >> > There are different levels of coherency -- VIVT architectures may have >> > differing requirements compared to PIPT, etc. >> > >> > In any case, I feel like the approach taken by the documentation is >> > fundamentally confusing. Architectures don't all speak the same >> > language How about something like: >> > >> > The SYNC_CORE operation causes all threads in the caller's address >> > space (including the caller) to execute an architecture-defined >> > barrier operation. membarrier() will ensure that this barrier is >> > executed at a time such that all data writes done by the calling >> > thread before membarrier() are made visible by the barrier. >> > Additional architecture-dependent cache management operations may be >> > required to use this for JIT code. >> >> As said this isn't what SYNC_CORE does, and it's not what powerpc >> context synchronizing instructions do either, it will very much >> re-order visibility of stores around such an instruction. > > Perhaps the docs should be entirely arch-specific. It may well be > impossible to state what it does in an arch-neutral way. I think what I wrote above -- after the call returns, none of the threads in the process will be executing instructions overwritten previously by the caller (provided their i-caches are made coherent with the caller's modifications). >> A thread completes store instructions into a store queue, which is >> as far as a context synchronizing event goes. Visibility comes at >> some indeterminite time later. > > As currently implemented, it has the same visibility semantics as > regular membarrier, too. Ah I actually missed that SYNC_CORE is in _addition_ to the memory barriers, and that's documented API too, not just implementation sorry. > So if I do: > > a = 1; > membarrier(SYNC_CORE); > b = 1; > > and another thread does: > > while (READ_ONCE(b) != 1) > ; > barrier(); > assert(a == 1); Right that's true, due to the MEMBARRIER_CMD_PRIVATE_EXPEDITED. Neither that barrier or the added SYNC_CORE behaviour imply visibility though. > > then the assertion will pass. Similarly, one can do this, I hope: > > memcpy(codeptr, [some new instructions], len); > arch_dependent_cache_flush(codeptr, len); > membarrier(SYNC_CORE); > ready = 1; > > and another thread does: > > while (READ_ONCE(ready) != 1) > ; > barrier(); > (*codeptr)(); > > arch_dependent_cache_flush is a nop on x86. On arm and arm64, it > appears to be a syscall, although maybe arm64 can do it from > userspace. I still don't know what it is on powerpc. > > Even using the term "cache" here is misleading. x86 chips have all > kinds of barely-documented instruction caches, and they have varying > degrees of coherency. The architecture actually promises that, if you > do a certain incantation, then you get the desired result. > membarrier() allows a user to do this incantation. But trying to > replicate the incantation verbatim on an architecture like ARM is > insufficient, and trying to flush the things that are documented as > being caches on x86 is expensive and a complete waste of time on x86. > When you write some JIT code, you do *not* want to flush it all the > way to main memory, especially on CPUs don't have the ability to write > back invalidating. (That's most CPUs.) > > Even on x86, I suspect that the various decoded insn caches are rather > more coherent than documented, and I have questions in to Intel about > this. No answers yet. > > So perhaps the right approach is to say that membarrier() helps you > perform the architecture-specific sequence of steps needed to safely > modify code. On x86, you use it like this. On arm64, you do this > other thing. On powerpc, you do something else. I think it should certainly be documented in terms of what guarantees it provides to application, _not_ the kinds of instructions it may or may not induce the core to execute. And if existing API can't be re-documented sanely, then deprecatd and new ones added that DTRT. Possibly under a new system call, if arch's like ARM want a range flush and we don't want to expand the multiplexing behaviour of membarrier even more (sigh). > >> >> I would be surprised if x86's serializing instructions were different >> than powerpc. rdtsc ordering or flushing stores to cache would be >> surprising. >> > > At the very least, x86 has several levels of what ARM might call > "context synchronization" AFAICT. STAC, CLAC, and POPF do a form of > context synchronization in that the changes they cause to the MMU take > effect immediately, but they are not documented as synchronizing the > instruction stream. "Serializing" instructions do all kinds of > things, not all of which may be architecturally visible at all. > MFENCE and LFENCE do various complicated things, and LFENCE has magic > retroactive capabilities on old CPUs that were not documented when > those CPUs were released. SFENCE does a different form of > synchronization entirely. LOCK does something else. (The > relationship between LOCK and MFENCE is confusing at best.) RDTSC > doesn't serialize anything at all, but RDTSCP does provide a form of > serialization that's kind of ilke LFENCE. It's a mess. Even the > manuals are inconsistent about what "serialize" means. JMP has its > own magic on x86, but only on very very old CPUs. > > The specific instruction that flushes everything into the coherency > domain is SFENCE, and SFENCE is not, for normal purposes, needed for > self- or cross-modifying code. > Good reason to avoid such language in the system call interface! Thanks, Nick