Re: [PATCH] sparc64: Exclude perf user callchain during critical sections

David Miller <davem@xxxxxxxxxxxxx> · Wed, 14 Jun 2017 12:25:04 -0400 (EDT)

From: Rob Gardner <rob.gardner@xxxxxxxxxx>
Date: Tue, 13 Jun 2017 23:05:57 -0600

> One particular critical section occurs in switch_mm:
> 
>         spin_lock_irqsave(&mm->context.lock, flags);
>         ...
>         load_secondary_context(mm);
>         tsb_context_switch(mm);
>         ...
>         spin_unlock_irqrestore(&mm->context.lock, flags);
> 
> If a perf interrupt arrives in between load_secondary_context() and
> tsb_context_switch(), then perf_callchain_user() could execute with
> the context ID of one process, but with an active tsb for a different
> process. When the user stack is accessed, it is very likely to
> incur a TLB miss, since the h/w context ID has been changed. The TLB
> will then be reloaded with a translation from the TSB for one process,
> but using a context ID for another process. This exposes memory from
> one process to another, and since it is a mapping for stack memory,
> this usually causes the new process to crash quickly.

Ahhh, good catch.

> Some potential solutions are:
> 
> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
> This would certainly eliminate the problem, but it would also prevent
> perf from having any visibility into code running in these critical
> sections, and it seems clear that PIL_NORMAL_MAX is used for just
> this reason.
> 
> 2) Protect this particular critical section by masking all interrupts,
> either by setting %pil to PIL_NMI or by clearing pstate.ie around the
> calls to load_secondary_context() and tsb_context_switch(). This approach
> has a few drawbacks:
> 
>  - It would only address this particular critical section, and would
>    have to be repeated in several other known places. There might be
>    other such critical sections that are not known.
> 
>  - It has a performance cost which would be incurred at every context
>    switch, since it would require additional accesses to %pil or
>    %pstate.
> 
>  - Turning off pstate.ie would require changing __tsb_context_switch(),
>    which expects to be called with pstate.ie on.
> 
> 3) Avoid user space MMU activity entirely in perf_callchain_user() by
> implementing a new copy_from_user() function that accesses the user
> stack via physical addresses. This works, but requires quite a bit of
> new code to get it to perform reasonably, ie, caching of translations,
> etc.
> 
> 4) Allow the perf interrupt to happen in existing critical sections as
> it does now, but have perf code detect that this is happening, and
> skip any user callchain processing. This approach was deemed best, as
> the change is extremely localized and covers both known and unknown
> instances of perf interrupting critical sections. Perf has interrupted
> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
> interrupt.
> 
> Ordinarily, a function that has a pt_regs passed in can simply examine
> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
> effect at the time of the perf interrupt.  However, when the perf
> interrupt occurs while executing in the kernel, the pt_regs pointer is
> replaced with task_pt_regs(current) in perf_callchain(), and so
> perf_callchain_user() does not have access to the state of the machine
> at the time of the perf interrupt. To work around this, we check
> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
> in to the arch independent part of perf, and temporarily change the
> event attributes so that user callchain processing is avoided. Though
> a user stack sample is not collected, this loss is not statistically
> significant. Kernel call graph collection is not affected.

I tried to figure out how x86 handles this, and gave up after some
time :-)  Can you figure it out?

PowerPC handles this by hard disabling IRQs from switch_mm until
switch_to completes.

I think excluding all local_irq_disabled() paths from having stack
backtraces is not good.  I'd rather see PIL_MAX get set from switch_mm
until switch_to completes.

Ok?

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html