Re: [PATCH] sparc64: Exclude perf user callchain during critical sections

David Miller <davem@xxxxxxxxxxxxx> · Wed, 14 Jun 2017 13:05:00 -0400 (EDT)

From: Rob Gardner <rob.gardner@xxxxxxxxxx>
Date: Wed, 14 Jun 2017 10:47:08 -0600

> On 06/14/2017 10:25 AM, David Miller wrote:
>> From: Rob Gardner <rob.gardner@xxxxxxxxxx>
>> Date: Tue, 13 Jun 2017 23:05:57 -0600
>>
>>> One particular critical section occurs in switch_mm:
>>>
>>>          spin_lock_irqsave(&mm->context.lock, flags);
>>>          ...
>>>          load_secondary_context(mm);
>>>          tsb_context_switch(mm);
>>>          ...
>>>          spin_unlock_irqrestore(&mm->context.lock, flags);
>>>
>>> If a perf interrupt arrives in between load_secondary_context() and
>>> tsb_context_switch(), then perf_callchain_user() could execute with
>>> the context ID of one process, but with an active tsb for a different
>>> process. When the user stack is accessed, it is very likely to
>>> incur a TLB miss, since the h/w context ID has been changed. The TLB
>>> will then be reloaded with a translation from the TSB for one process,
>>> but using a context ID for another process. This exposes memory from
>>> one process to another, and since it is a mapping for stack memory,
>>> this usually causes the new process to crash quickly.
>> Ahhh, good catch.
>>
>>> Some potential solutions are:
>>>
>>> 1) Make critical sections run at PIL_NMI instead of PIL_NORMAL_MAX.
>>> This would certainly eliminate the problem, but it would also prevent
>>> perf from having any visibility into code running in these critical
>>> sections, and it seems clear that PIL_NORMAL_MAX is used for just
>>> this reason.
>>>
>>> 2) Protect this particular critical section by masking all interrupts,
>>> either by setting %pil to PIL_NMI or by clearing pstate.ie around the
>>> calls to load_secondary_context() and tsb_context_switch(). This
>>> approach
>>> has a few drawbacks:
>>>
>>>   - It would only address this particular critical section, and would
>>>     have to be repeated in several other known places. There might be
>>>     other such critical sections that are not known.
>>>
>>>   - It has a performance cost which would be incurred at every context
>>>     switch, since it would require additional accesses to %pil or
>>>     %pstate.
>>>
>>>   - Turning off pstate.ie would require changing __tsb_context_switch(),
>>>     which expects to be called with pstate.ie on.
>>>
>>> 3) Avoid user space MMU activity entirely in perf_callchain_user() by
>>> implementing a new copy_from_user() function that accesses the user
>>> stack via physical addresses. This works, but requires quite a bit of
>>> new code to get it to perform reasonably, ie, caching of translations,
>>> etc.
>>>
>>> 4) Allow the perf interrupt to happen in existing critical sections as
>>> it does now, but have perf code detect that this is happening, and
>>> skip any user callchain processing. This approach was deemed best, as
>>> the change is extremely localized and covers both known and unknown
>>> instances of perf interrupting critical sections. Perf has interrupted
>>> a critical section when %pil == PIL_NORMAL_MAX at the time of the perf
>>> interrupt.
>>>
>>> Ordinarily, a function that has a pt_regs passed in can simply examine
>>> (reg->tstate & TSTATE_PIL) to extract the interrupt level that was in
>>> effect at the time of the perf interrupt.  However, when the perf
>>> interrupt occurs while executing in the kernel, the pt_regs pointer is
>>> replaced with task_pt_regs(current) in perf_callchain(), and so
>>> perf_callchain_user() does not have access to the state of the machine
>>> at the time of the perf interrupt. To work around this, we check
>>> (regs->tstate & TSTATE_PIL) in perf_event_nmi_handler() before calling
>>> in to the arch independent part of perf, and temporarily change the
>>> event attributes so that user callchain processing is avoided. Though
>>> a user stack sample is not collected, this loss is not statistically
>>> significant. Kernel call graph collection is not affected.
>> I tried to figure out how x86 handles this, and gave up after some
>> time :-)  Can you figure it out?
> 
> I could look.
> 
>>
>> PowerPC handles this by hard disabling IRQs from switch_mm until
>> switch_to completes.
>>
>> I think excluding all local_irq_disabled() paths from having stack
>> backtraces is not good.
> 
> Just to clarify, this does not exclude local_irq_disabled() paths from
> having stack backtraces. It only disables collection of *user* stack
> backtraces during these (very rare) times. Collection of *kernel*
> stack traces is not affected, so we still get the same visibility from
> perf for kernel call graphs. I do not believe we sacrifice anything
> here in terms of user stack collection since we'll only skip doing
> this when:
>  1) we're already executing in the kernel when the perf interrupt
>  arrives, and this is already (what should be) a rare event
>  2) and kernel is executing in a local_irq_disabled() path, also should
>  be very rare

I understood this from the beginning.

>>   I'd rather see PIL_MAX get set from switch_mm
>> until switch_to completes.
>>
>> Ok?
>>
> 
> In light of the above, what is your preference?

I still have the same position.

That lack of user backtrace is a loss of information.  And in order to
make it work we must have the case that the PGD/TSB/mmu-context/stack
all match up precisely.
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html