On 17-Jun-2021 02:51:33 PM, Mark Rutland wrote: > On Thu, Jun 17, 2021 at 06:41:41AM -0700, Andy Lutomirski wrote: > > > > > > On Thu, Jun 17, 2021, at 4:33 AM, Mark Rutland wrote: > > > On Thu, Jun 17, 2021 at 12:23:05PM +0100, Russell King (Oracle) wrote: > > > > On Thu, Jun 17, 2021 at 11:40:46AM +0100, Mark Rutland wrote: > > > > > On Tue, Jun 15, 2021 at 08:21:12PM -0700, Andy Lutomirski wrote: > > > > > > On arm32, the only way to safely flush icache from usermode is to call > > > > > > cacheflush(2). This also handles any required pipeline flushes, so > > > > > > membarrier's SYNC_CORE feature is useless on arm. Remove it. > > > > > > > > > > Unfortunately, it's a bit more complicated than that, and these days > > > > > SYNC_CORE is equally necessary on arm as on arm64. This is something > > > > > that changed in the architecture over time, but since ARMv7 we generally > > > > > need both the cache maintenance *and* a context synchronization event > > > > > (the latter must occur on the CPU which will execute the instructions). > > > > > > > > > > If you look at the latest ARMv7-AR manual (ARM DDI 406C.d), section > > > > > A3.5.4 "Concurrent modification and execution of instructions" covers > > > > > this. That manual can be found at: > > > > > > > > > > https://developer.arm.com/documentation/ddi0406/latest/ > > > > > > > > Looking at that, sys_cacheflush() meets this. The manual details a > > > > series of cache maintenance calls in "step 1" that the modifying thread > > > > must issue - this is exactly what sys_cacheflush() does. The same is > > > > true for ARMv6, except the "ISB" terminology is replaced by a > > > > "PrefetchFlush" terminology. (I checked DDI0100I). > > > > > > > > "step 2" requires an ISB on the "other CPU" prior to executing that > > > > code. As I understand it, in ARMv7, userspace can issue an ISB itself. > > > > > > > > For ARMv6K, it doesn't have ISB, but instead has a CP15 instruction > > > > for this that isn't availble to userspace. This is where we come to > > > > the situation about ARM 11MPCore, and whether we continue to support > > > > it or not. > > > > > > > > So, I think we're completely fine with ARMv7 under 32-bit ARM kernels > > > > as userspace has everything that's required. ARMv6K is a different > > > > matter as we've already identified for several reasons. > > > > > > Sure, and I agree we should not change cacheflush(). > > > > > > The point of membarrier(SYNC_CORE) is that you can move the cost of that > > > ISB out of the fast-path in the executing thread(s) and into the > > > slow-path on the thread which generated the code. > > > > > > So e.g. rather than an executing thread always having to do: > > > > > > LDR <reg>, [<funcptr>] > > > ISB // in case funcptr was just updated > > > BLR <reg> > > > > > > ... you have the thread generating the code use membarrier(SYNC_CORE) > > > prior to plublishing the funcptr, and the fast-path on all the executing > > > threads can be: > > > > > > LDR <reg> [<funcptr>] > > > BLR <reg> > > > > > > ... and thus I think we still want membarrier(SYNC_CORE) so that people > > > can do this, even if there are other means to achieve the same > > > functionality. > > > > I had the impression that sys_cacheflush() did that. Am I wrong? > > Currently sys_cacheflush() doesn't do this, and IIUC it has never done > remote context synchronization even for architectures that need that > (e.g. x86 requiring a serializing instruction). > > > In any event, I’m even more convinced that no new SYNC_CORE arches > > should be added. We need a new API that just does the right thing. > > My intuition is the other way around, and that this is a gnereally > useful thing for architectures that require context synchronization. > > It's not clear to me what "the right thing" would mean specifically, and > on architectures with userspace cache maintenance JITs can usually do > the most optimal maintenance, and only need help for the context > synchronization. If I can attempt to summarize the current situation for ARMv7: - In addition to the cache flushing on the core doing the code update, the architecture requires every core to perform a context synchronizing instruction before executing the updated code. - sys_cacheflush() don't do this core sync on every core. It also takes a single address range as parameter. - ARM, ARM64, powerpc, powerpc64, x86, x86-64 all currently handle the context synchronization requirement for updating user-space code on SMP with sys_membarrier SYNC_CORE. It's not, however, meant to replace explicit cache flushing operations if those are needed. So removing membarrier SYNC_CORE from ARM would be a step backward here. On ARMv7, the SYNC_CORE is needed _in addition_ to sys_cacheflush. Adding a sync-core operation at the end of sys_cacheflush would be inefficient for common GC use-cases where a rather large set of address ranges are invalidated in one go: for this, we either want the GC to: - Invoke sys_cacheflush for each targeted range, and then issue a single sys_membarrier SYNC_CORE, or - Implement a new "sys_cacheflush_iov" which takes an iovec input. There I see that it could indeed invalidate all relevant cache lines *and* issue the SYNC_CORE at the end. But shoehorning the SYNC_CORE in the pre-existing sys_cacheflush after the fact seems like a bad idea. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com