----- On Dec 13, 2021, at 3:12 PM, Florian Weimer fweimer@xxxxxxxxxx wrote: > * Mathieu Desnoyers: > >> ----- On Dec 13, 2021, at 2:29 PM, Florian Weimer fweimer@xxxxxxxxxx wrote: >> >>> * Mathieu Desnoyers: >>> >>>>> Could it fall back to >>>>> MEMBARRIER_CMD_GLOBAL instead? >>>> >>>> No. CMD_GLOBAL does not issue the required rseq fence used by the >>>> algorithm discussed. Also, CMD_GLOBAL has quite a few other shortcomings: >>>> it takes a while to execute, and is incompatible with nohz_full kernels. >>> >>> What about using sched_setcpu to move the current thread to the same CPU >>> (and move it back afterwards)? Surely that implies the required sort of >>> rseq barrier that MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ with >>> MEMBARRIER_CMD_FLAG_CPU performs? >> >> I guess you refer to using sched_setaffinity(2) there ? There are various >> reasons why this may fail. For one, the affinity mask is a shared global >> resource which can be changed by external applications. > > So is process memory … Fair point. > >> Also, setting the affinity is really just a hint. In the presence of >> cpu hotplug and or cgroup cpuset, it is known to lead to situations >> where the kernel just gives up and provides an affinity mask including >> all CPUs. > > How does CPU hotplug impact this negatively? It may be OK for the rseq fence use-case specifically, but in general relying on cpu affinity to "pin" to a specific CPU is problematic with a hotplug scenario like this: - Userspace thread sets affinity to CPU 3 (only) - echo 0 > /sys/devices/system/cpu/cpu3/online (as root) -> scheduler will hit: select_fallback_rq(): if (cpuset_cpus_allowed_fallback(p)) { -> false do_set_cpus_allowed(p, task_cpu_possible_mask(p)); thus setting the cpus allowed mask to "any of the possible cpus". This can be confirmed by doing "cat /proc/${pid}/status | grep Cpus_allowed_list:" before/after unplugging cpu 3. (side-note: in my test, the target application was "sleep 5000", which never gets picked by the scheduler unless we force some activity on it by delivering a signal. I used a SIGSTOP/SIGCONT.): before: Cpus_allowed_list: 3 after: Cpus_allowed_list: 0-3 > > The cgroup cpuset issue clearly is a bug. For cgroupv2, there are cpuset.cpus (invariant wrt hotplug), cpuset.cpus.effective (affected by hotplug) and cpuset.cpus.partition (takes away from parent effective cpuset, invariant after creation). cgroup controllers can be either threaded controllers or domain controllers. Unfortunately cpuset is a threaded controller, which means each thread can have its own cgroup cpuset. I do not have a full understanding of the interaction between sched_setaffinity and concurrent changes to the cgroup cpuset, but I am concerned that scenarios where affinity is first "pinned" to a specific cpu, and then an external process manager changes the cpuset.cpus mask to exclude that cpu may cause issues. I am also concerned for the rseq fence use-case (done with explicit sched_setaffinity) about what would happen if threads belong to different cgroup cpusets with threaded controllers. There we may have situations where a thread fails to run on a specific CPU just because it is not part of its cpuset, but another thread within the same process successfully runs there while executing an rseq critical section. > >> Therefore, using sched_setaffinity() and expecting to be pinned to >> a specific CPU for correctness purposes seems brittle. > > I'm pretty sure it used to work reliably for some forms of concurrency > control. That being said, it may be OK for the specific case of an rseq-fence, considering that if we affine to CPU A, and later discover that we run anywhere except on CPU A while we explicitly requested to be pinned to that CPU, this means the kernel had to take action and move us away from CPU A's runqueue because we are not allowed to run there. So we could consider that this CPU is "quiescent" in terms of rseq because no other thread belonging to our process runs there. This appears to work only for cpusets applying to the entire process though, not for threaded cpusets. > >> But _if_ we'd have something like a sched_setaffinity which we can >> trust, yes, temporarily migrating to the target CPU, and observing that >> we indeed run there, would AFAIU provide the same guarantee as the rseq >> fence provided by membarrier. It would have a higher overhead than >> membarrier as well. > > Presumably a signal could do it as well. Fair point, but then you would have to send a signal to every thread, and wait for each signal handler to have executed. membarrier improves on this kind of scheme by integrating with the scheduler, and leverage its knowledge of which thread is actively running or not. Therefore, if a thread is not running, there is no need to awaken it. This makes a huge performance difference for heavily multi-threaded applications. > >>> That is possible even without membarrier, so I wonder why registration >>> of intent is needed for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ. >> >> I would answer that it is not possible to do this _reliably_ today >> without membarrier (see above discussion of cpu hotplug, cgroups, and >> modification of cpu affinity by external processes). >> >> AFAIR, registration of intent for MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ >> is mainly there to provide a programming model similar to private expedited >> plain and core-sync cmds. >> >> The registration of intent allows the kernel to further tweak what is >> done internally and make tradeoffs which only impact applications >> performing the registration. > > But if there is no strong performance argument to do so, this introduces > additional complexity into userspace. Surely we could say we just do > MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ at process start and document > failure (in case of seccomp etc.), but then why do this at all? There are many performance gains we can get by having membarrier-expedited-rseq registered. Some of those use-cases may be doable either by sending signals to all threads, or by doing cpu affinity tricks, but using membarrier is much more lightweight thanks to its integration with the Linux scheduler. When a thread is not running, there is really no need to awaken it. In terms of use-cases, the rseq-fence is a compelling use-case enabling various algorithms with rseq. Other use-cases involve the "plain" memory barrier capabilities of membarrier. This generally allow turning algorithms that require pairing memory barrier instructions on fast and slow paths into even faster fast-path, by pairing compiler barriers (asm memory clobber) on the fast paths with membarrier system calls on the slow paths. Finally, other use-cases involves the SYNC_CORE membarrier. This is mainly for JITs, allowing them to issue a process-wide "fence" allowing them to re-use memory after reclaim of JITted code. In terms of overhead added into the process when membarrier-expedited is registered, only specific cases are affected: - SYNC_CORE: processes which have registered membarrier expedited sync-core will issue sync_core_before_usermode() after each scheduling between threads belonging to different processes (see membarrier_mm_sync_core_before_usermode). It is a no-op for all architectures except x86, which implements its return to user-space with sysexit, sysrel and sysretq, which are not core serializing. Because of the runtime overhead of the sync-core registration on x86, I would recommend that only JITs register with MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_SYNC_CORE. - Plain memory barrier and RSEQ: Registering those adds no overhead except on powerpc (see membarrier_arch_switch_mm). There, when context switching between two user-space processes, an additional memory barrier is needed because it is not implicitly issued by the architecture switch_mm. I expect that the impact of this runtime overhead will be much more limited than for the SYNC_CORE. Therefore having glibc auto-register MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ would make sense considering the fast-path improvements this enables. All of the expedited membarrier commands issue inter-processor interrupts (IPIs) to CPUs running threads belonging to the same process. This may be unexpected for hard-real-time applications, so this may be something they will want to opt-out from with a tunable. There are also the "global-expedited" membarrier commands, which are done to deal with shared memory across processes. There, the processes wishing to receive the IPIs need to be registered explicitly. This ensures we don't disrupt other hard-real-time processes with unexpected IPIs. The processes registered for global-expedited membarrier also have the same overhead discussed above for plain/rseq membarrier registration on powerpc. I do not expect the global-expedited registration to be done automatically, it should really be opt-in by the applications/libraries requiring membarrier to interact with other processes across shared memory. > >>>> In order to make sure the programming model is the same for expedited >>>> private/global plain/sync-core/rseq membarrier commands, we require that >>>> each process perform a registration beforehand. >>> >>> Hmm. At least it's not possible to unregister again. >>> >>> But I think it would be really useful to have some of these barriers >>> available without registration, possibly in a more expensive form. >> >> What would be wrong with doing a membarrier private-expedited-rseq >> registration on libc startup, and exposing a glibc tunable to allow >> disabling this ? > > The configurations that need to be supported go from “no rseq“/“rseq” > to “no rseq“/“rseq”/“rseq with membarrier”. Everyone now needs to > think about implementing support for all three instead just the obvious > two. One thing to keep in mind is that within the Linux kernel, CONFIG_RSEQ always selects CONFIG_MEMBARRIER. I've done this on purpose to simplify the user-space programming model. Therefore, if the rseq system call is implemented, membarrier is available, unless it's forbidden by seccomp. However, MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ only appears in kernel v5.10. This means that starting from kernel v5.10, glibc can rely on having both rseq and membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ available, or don't bother to do any of the registration. This would simplify the programming model from a user perspective. If glibc registers rseq, this guarantees that MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ is available. You can check for rseq availability with e.g.: int rseq_available(void) { int rc; rc = sys_rseq(NULL, 0, 0, 0); if (rc != -1) abort(); switch (errno) { case ENOSYS: return 0; case EINVAL: return 1; default: abort(); } } and check for membarrier MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ availability by inspecting the mask returned by MEMBARRIER_CMD_QUERY, e.g.: int status; status = sys_membarrier(MEMBARRIER_CMD_QUERY, 0); if (status < 0 || !(status & MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ)) return false; else return true; I guess it all depends on how much you care about registering rseq on kernels between 4.18 and 5.9 inclusively. Thanks, Mathieu > > Thanks, > Florian -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com