From: Rob Gardner <rob.gardner@xxxxxxxxxx> Date: Tue, 13 Jun 2017 23:05:57 -0600 > One particular critical section occurs in switch_mm: > > spin_lock_irqsave(&mm->context.lock, flags); > ... > load_secondary_context(mm); > tsb_context_switch(mm); > ... > spin_unlock_irqrestore(&mm->context.lock, flags); > > If a perf interrupt arrives in between load_secondary_context() and > tsb_context_switch(), then perf_callchain_user() could execute with > the context ID of one process, but with an active tsb for a different > process. When the user stack is accessed, it is very likely to > incur a TLB miss, since the h/w context ID has been changed. The TLB > will then be reloaded with a translation from the TSB for one process, > but using a context ID for another process. This exposes memory from > one process to another, and since it is a mapping for stack memory, > this usually causes the new process to crash quickly. Ahhh, good catch. > Some potential solutions are: > > 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX. > This would certainly eliminate the problem, but it would also prevent > perf from having any visibility into code running in these critical > sections, and it seems clear that PIL_NORMAL_MAX is used for just > this reason. > > 2) Protect this particular critical section by masking all interrupts, > either by setting %pil to PIL_NMI or by clearing pstate.ie around the > calls to load_secondary_context() and tsb_context_switch(). This approach > has a few drawbacks: > > - It would only address this particular critical section, and would > have to be repeated in several other known places. There might be > other such critical sections that are not known. > > - It has a performance cost which would be incurred at every context > switch, since it would require additional accesses to %pil or > %pstate. > > - Turning off pstate.ie would require changing __tsb_context_switch(), > which expects to be called with pstate.ie on. > > 3) Avoid user space MMU activity entirely in perf_callchain_user() by > implementing a new copy_from_user() function that accesses the user > stack via physical addresses. This works, but requires quite a bit of > new code to get it to perform reasonably, ie, caching of translations, > etc. > > 4) Allow the perf interrupt to happen in existing critical sections as > it does now, but have perf code detect that this is happening, and > skip any user callchain processing. This approach was deemed best, as > the change is extremely localized and covers both known and unknown > instances of perf interrupting critical sections. Perf has interrupted > a critical section when %pil == PIL_NORMAL_MAX at the time of the perf > interrupt. > > Ordinarily, a function that has a pt_regs passed in can simply examine > (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in > effect at the time of the perf interrupt. However, when the perf > interrupt occurs while executing in the kernel, the pt_regs pointer is > replaced with task_pt_regs(current) in perf_callchain(), and so > perf_callchain_user() does not have access to the state of the machine > at the time of the perf interrupt. To work around this, we check > (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling > in to the arch independent part of perf, and temporarily change the > event attributes so that user callchain processing is avoided. Though > a user stack sample is not collected, this loss is not statistically > significant. Kernel call graph collection is not affected. I tried to figure out how x86 handles this, and gave up after some time :-) Can you figure it out? PowerPC handles this by hard disabling IRQs from switch_mm until switch_to completes. I think excluding all local_irq_disabled() paths from having stack backtraces is not good. I'd rather see PIL_MAX get set from switch_mm until switch_to completes. Ok? -- To unsubscribe from this list: send the line "unsubscribe sparclinux" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html