----- On Nov 17, 2017, at 5:09 AM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote: > On Thu, 16 Nov 2017, Andi Kleen wrote: >> My preference would be just to drop this new super ugly system call. >> >> It's also not just the ugliness, but the very large attack surface >> that worries me here. >> >> As far as I know it is only needed to support single stepping, correct? > > I can't figure that out because the changelog describes only WHAT the patch > does and not WHY. Useful, isn't it? > >> Then this whole mess would disappear. > > Agreed. That would be much appreciated. Let's have a look at why cpu_opv is needed. I'll make sure to enhance the changelog and documentation to include that information. 1) Handling single-stepping from tools Tools like debuggers, and simulators like record-replay ("rr") use single-stepping to run through existing programs. If core libraries start to use restartable sequences for e.g. memory allocation, this means pre-existing programs cannot be single-stepped, simply because the underlying glibc or jemalloc has changed. The rseq user-space does expose a __rseq_table section for the sake of debuggers, so they can skip over the rseq critical sections if they want. However, this requires upgrading tools, and still breaks single-stepping in case where glibc or jemalloc is updated, but not the tooling. Having a performance-related library improvement break tooling is likely to cause a big push-back against wide adoption of rseq. *I* would not even be using rseq in liburcu and lttng-ust until gdb gets updated in every distributions that my users depend on. This will likely happen... never. 2) Forward-progress guarantee Having a piece of user-space code that stops progressing due to external conditions is pretty bad. We are used to think of fast-path and slow-path (e.g. for locking), where the contended vs uncontended cases have different performance characteristics, but each need to provide some level of progress guarantees. I'm very concerned about proposing just "rseq" without the associated slow-path (cpu_opv) that guarantees progress. It's just asking for trouble when real-life will happen: page faults, uprobes, and other unforeseen conditions that would seldom cause a rseq fast-path to never progress. 3) Handling page faults If we get creative enough, it's pretty easy to come up with corner-case scenarios where rseq does not progress without the help from cpu_opv. For instance, a system with swap enabled which is under high memory pressure could trigger page faults at pretty much every rseq attempt. I recognize that this scenario is extremely unlikely, but I'm not comfortable making rseq the weak link of the chain here. 4) Comparison with LL/SC The layman versed in the load-link/store-conditional instructions in RISC architectures will notice the similarity between rseq and LL/SC critical sections. The comparison can even be pushed further: since debuggers can handle those LL/SC critical sections, they should be able to handle rseq c.s. in the same way. First, the way gdb recognises LL/SC c.s. patterns is very fragile: it's limited to specific common patterns, and will miss the pattern in all other cases. But fear not, having the rseq c.s. expose a __rseq_table to debuggers removes that guessing part. The main difference between LL/SC and rseq is that debuggers had to support single-stepping through LL/SC critical sections from the get go in order to support a given architecture. For rseq, we're adding critical sections into pre-existing applications/libraries, so the user expectation is that tools don't break due to a library optimization. 5) Perform maintenance operations on per-cpu data rseq c.s. are quite limited feature-wise: they need to end with a *single* commit instruction that updates a memory location. On the other hand, the cpu_opv system call can combine a sequence of operations that need to be executed with preemption disabled. While slower than rseq, this allows for more complex maintenance operations to be performed on per-cpu data concurrently with rseq fast-paths, in cases where it's not possible to map those sequences of ops to a rseq. 6) Use cpu_opv as generic implementation for architectures not implementing rseq assembly code rseq critical sections require architecture-specific user-space code to be crafted in order to port an algorithm to a given architecture. In addition, it requires that the kernel architecture implementation adds hooks into signal delivery and resume to user-space. In order to facilitate integration of rseq into user-space, cpu_opv can provide a (relatively slower) architecture-agnostic implementation of rseq. This means that user-space code can be ported to all architectures through use of cpu_opv initially, and have the fast-path use rseq whenever the asm code is implemented. 7) Allow libraries with multi-part algorithms to work on same per-cpu data without affecting the allowed cpu mask I stumbled on an interesting use-case within the lttng-ust tracer per-cpu buffers: the algorithm needs to update a "reserve" counter, serialize data into the buffer, and then update a "commit" counter _on the same per-cpu buffer_. My goal is to use rseq for both reserve and commit. Clearly, if rseq reserve fails, the algorithm can retry on a different per-cpu buffer. However, it's not that easy for the commit. It needs to be performed on the same per-cpu buffer as the reserve. The cpu_opv system call solves that problem by receiving the cpu number on which the operation needs to be performed as argument. It can push the task to the right CPU if needed, and perform the operations there with preemption disabled. Changing the allowed cpu mask for the current thread is not an acceptable alternative for a tracing library, because the application being traced does not expect that mask to be changed by libraries. So, TLDR: cpu_opv is needed for many use-cases other that single-stepping, and facilitates adoption of rseq into pre-existing applications. Thanks, Mathieu > > Thanks, > > tglx -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html