Re: [RFC PATCH v2 0/4] arm64: Implement stack trace reliability checks

Josh Poimboeuf <jpoimboe@xxxxxxxxxx> · Fri, 9 Apr 2021 17:32:27 -0500

On Fri, Apr 09, 2021 at 05:05:58PM -0500, Madhavan T. Venkataraman wrote:
> > FWIW, over the years we've had zero issues with encoding the frame
> > pointer on x86.  After you save pt_regs, you encode the frame pointer to
> > point to it.  Ideally in the same macro so it's hard to overlook.
> > 
> 
> I had the same opinion. In fact, in my encoding scheme, I have additional
> checks to make absolutely sure that it is a true encoding and not stack
> corruption. The chances of all of those values accidentally matching are,
> well, null.

Right, stack corruption -- which is already exceedingly rare -- would
have to be combined with a miracle or two in order to come out of the
whole thing marked as 'reliable' :-)

And really, we already take a similar risk today by "trusting" the frame
pointer value on the stack to a certain extent.

> >> I think there's a lot more code that we cannot unwind, e.g. KVM
> >> exception code, or almost anything marked with SYM_CODE_END().
> > 
> > Just a reminder that livepatch only unwinds blocked tasks (plus the
> > 'current' task which calls into livepatch).  So practically speaking, it
> > doesn't matter whether the 'unreliable' detection has full coverage.
> > The only exceptions which really matter are those which end up calling
> > schedule(), e.g. preemption or page faults.
> > 
> > Being able to consistently detect *all* possible unreliable paths would
> > be nice in theory, but it's unnecessary and may not be worth the extra
> > complexity.
> > 
> 
> You do have a point. I tried to think of arch_stack_walk_reliable() as
> something that should be implemented independent of livepatching. But
> I could not really come up with a single example of where else it would
> really be useful.
> 
> So, if we assume that the reliable stack trace is solely for the purpose
> of livepatching, I agree with your earlier comments as well.

One thought: if folks really view this as a problem, it might help to
just rename things to reduce confusion.

For example, instead of calling it 'reliable', we could call it
something more precise, like 'klp_reliable', to indicate that its
reliable enough for live patching.

Then have a comment above 'klp_reliable' and/or
stack_trace_save_tsk_klp_reliable() which describes what that means.

Hm, for that matter, even without renaming things, a comment above
stack_trace_save_tsk_reliable() describing the meaning of "reliable"
would be a good idea.

-- 
Josh