Re: [PATCH v1 15/40] i386/tdx: Add property sept-ve-disable for tdx-guest object

Gerd Hoffmann <kraxel@xxxxxxxxxx> · Fri, 2 Sep 2022 07:46:21 +0200

On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote:
> On Fri, Sep 02, 2022, Xiaoyao Li wrote:
> > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote:
> > >    Hi,
> > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g.,
> > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap
> > > > [1].
> > > 
> > > Why is that a problem for a TD guest kernel?  Installing exception
> > > handlers is done quite early in the boot process, certainly before any
> > > userspace code runs.  So I think we should never see a syscall without
> > > a #VE handler being installed.  /me is confused.
> > > 
> > > Or do you want tell me linux has no #VE handler?
> > 
> > The problem is not "no #VE handler" and Linux does have #VE handler. The
> > problem is Linux doesn't want any (or certain) exception occurrence in
> > syscall gap, it's not specific to #VE. Frankly, I don't understand the
> > reason clearly, it's something related to IST used in x86 Linux kernel.
> 
> The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction
> at the SYSCALL entry point runs with a userspaced-controlled RSP.  With TDX, a
> malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel
> to run the #VE handler with a userspace stack.
> 
> The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE,
> but ISTs are terrible because they don't play nice with re-entrancy (among other
> reasons).  The RSP used for IST-based handlers is hardcoded, and so if a #VE
> handler triggers another #VE at any point before IRET, the second #VE will clobber
> the stack and hose the kernel.
> v
> It's possible to workaround this, e.g. change the IST entry at the very beginning
> of the handler, but it's a maintenance burden.  Since the only reason to use an IST
> is to guard against a malicious hypervisor, Linux decided it would be just as easy
> and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely.

Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then?

Having a hypervisor-controlled config bit to protect against a malicious
hypervisor looks pointless to me ...

take care,
  Gerd