Re: [PATCH v4 11/13] KVM: arm64: Improve CONFIG_CFI_CLANG error message

Pierre-Clément Tosi <ptosi@xxxxxxxxxx> · Tue, 4 Jun 2024 17:05:59 +0100

On Mon, Jun 03, 2024 at 03:48:08PM +0100, Will Deacon wrote:
> On Wed, May 29, 2024 at 01:12:17PM +0100, Pierre-Clément Tosi wrote:
> > For kCFI, the compiler encodes in the immediate of the BRK (which the
> > CPU places in ESR_ELx) the indices of the two registers it used to hold
> > (resp.) the function pointer and expected type. Therefore, the kCFI
> > handler must be able to parse the contents of the register file at the
> > point where the exception was triggered.
> > 
> > To achieve this, introduce a new hypervisor panic path that first stores
> > the CPU context in the per-CPU kvm_hyp_ctxt before calling (directly or
> > indirectly) hyp_panic() and execute it from all EL2 synchronous
> > exception handlers i.e.
> > 
> > - call it directly in host_el2_sync_vect (__kvm_hyp_host_vector, EL2t&h)
> > - call it directly in el2t_sync_invalid (__kvm_hyp_vector, EL2t)
> > - set ELR_EL2 to it in el2_sync (__kvm_hyp_vector, EL2h), which ERETs
> > 
> > Teach hyp_panic() to decode the kCFI ESR and extract the target and type
> > from the saved CPU context. In VHE, use that information to panic() with
> > a specialized error message. In nVHE, only report it if the host (EL1)
> > has access to the saved CPU context i.e. iff CONFIG_NVHE_EL2_DEBUG=y,
> > which aligns with the behavior of CONFIG_PROTECTED_NVHE_STACKTRACE.
> > 
> > Signed-off-by: Pierre-Clément Tosi <ptosi@xxxxxxxxxx>
> > ---
> >  arch/arm64/kvm/handle_exit.c            | 30 +++++++++++++++++++++++--
> >  arch/arm64/kvm/hyp/entry.S              | 24 +++++++++++++++++++-
> >  arch/arm64/kvm/hyp/hyp-entry.S          |  2 +-
> >  arch/arm64/kvm/hyp/include/hyp/switch.h |  4 ++--
> >  arch/arm64/kvm/hyp/nvhe/host.S          |  2 +-
> >  arch/arm64/kvm/hyp/vhe/switch.c         | 26 +++++++++++++++++++--
> >  6 files changed, 79 insertions(+), 9 deletions(-)
> 
> This quite a lot of work just to print out some opaque type numbers
> when CONFIG_NVHE_EL2_DEBUG=y. Is it really worth it? How would I use
> this information to debug an otherwise undebuggable kcfi failure at EL2?

The type ID alone might not be worth it but what about the target?

In my experience, non-malicious kCFI failures are often caused by an issue with
the callee, not the caller. Without this patch, only the source of the exception
is reported but, with it, the panic handler also prints the kCFI target (i.e.
value of the function pointer) as a symbol.

With the infrastructure for the target in place, it isn't much more work to also
report the type ID. Although it is rarely informative (as you noted), there are
some situations where it can still be useful e.g. if reported as zero and/or has
been corrupted and does not match its value from the ELF.