----- On Jan 30, 2020, at 6:10 AM, Florian Weimer fweimer@xxxxxxxxxx wrote: > * Mathieu Desnoyers: > >> It brings an interesting idea to the table though. Let's assume for now that >> the only intended use of pin_on_cpu(2) would be to allow rseq(2) critical >> sections to update per-cpu data on specific cpu number targets. In fact, >> considering that userspace can be preempted at any point, we still need a >> mechanism to guarantee atomicity with respect to other threads running on >> the same runqueue, which rseq(2) provides. Therefore, that assumption does >> not appear too far-fetched. >> >> There are 2 scenarios we need to consider here: >> >> A) pin_on_cpu(2) targets a CPU which is not part of the affinity mask. >> >> This case is easy: pin_on_cpu can return an error, and the caller needs to act >> accordingly (e.g. figure out that this is a design error and report it, or >> decide that it really did not want to touch that per-cpu data that badly and >> make the entire process fall-back to a mechanism which does not use per-cpu >> data at all from that point onwards) > > Affinity masks currently are not like process memory: there is an > expectation that they can be altered from outside the process. Yes, that's my main issue. > Given that the caller may not have any ways to recover from the > suggested pin_on_cpu behavior, that seems problematic. Indeed. > > What I would expect is that if pin_on_cpu cannot achieve implied > exclusion by running on the associated CPU, it acquires a lock that > prevents others pin_on_cpu calls from entering the critical section, and > tasks in the same task group from running on that CPU (if the CPU > becomes available to the task group). The second part should maintain > exclusion of rseq sequences even if their fast path is not changed. I try to avoid mutual exclusion over shared memory as rseq fallback whenever I can, so we can use rseq from lock-free algorithms without losing lock-freedom. > (On the other hand, I'm worried that per-CPU data structures are a dead > end for user space unless we get containerized affinity masks, so that > contains only see resources that are actually available to them.) I'm currently implementing a prototype of the following ideas, and I'm curious to read your thoughts on those: I'm adding a "affinity_pinned" flag to the task struct of each thread. It can be set and cleared only by the owner thread through pin_on_cpu syscall commands. When the affinity is pinned by a thread, trying to change its affinity (from an external thread, or possibly from itself) will fail. Whenever a thread would (temporarily) pin itself on a specific CPU, it would also pin its affinity mask as a side-effect. When a thread unpins from a CPU, the affinity mask stays pinned. The purpose of keeping this affinity pinned state per-thread is to ensure we don't end up with tiny race windows where changing the thread's affinity mask "typically" works, but fails once in a while because it's done concurrently with a 1ms long cpu pinning. This would lead to flaky code, and I try hard to avoid that. How changing this affinity should fail (from sched_setaffinity and cpusets) is a big unanswered question. I see two major alternatives so far: 1) We deliver a signal to the target thread (SIGKILL ? SIGSEGV ?), considering that failure to be able to change its affinity mask means we need to send a signal. How exactly would the killed application recover (or if it should) is still unclear. 2) Return an error to the sched_setaffinity or cpusets caller, and let it deal with the error as it sees fit: ignore it, log it, or send a signal. I think option (2) provides the most flexiblity, and moves policy outside of the kernel, which is a good thing. However, looking at how cpusets seems to simply ignore errors when setting a task's cpumask, I wonder if asking from cpusets to handle any kind of error is asking too much. :-/ Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com