On Fri, Sep 02, 2022, Gerd Hoffmann wrote: > On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote: > > On Fri, Sep 02, 2022, Xiaoyao Li wrote: > > > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote: > > > > Hi, > > > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., > > > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap > > > > > [1]. > > > > > > > > Why is that a problem for a TD guest kernel? Installing exception > > > > handlers is done quite early in the boot process, certainly before any > > > > userspace code runs. So I think we should never see a syscall without > > > > a #VE handler being installed. /me is confused. > > > > > > > > Or do you want tell me linux has no #VE handler? > > > > > > The problem is not "no #VE handler" and Linux does have #VE handler. The > > > problem is Linux doesn't want any (or certain) exception occurrence in > > > syscall gap, it's not specific to #VE. Frankly, I don't understand the > > > reason clearly, it's something related to IST used in x86 Linux kernel. > > > > The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction > > at the SYSCALL entry point runs with a userspaced-controlled RSP. With TDX, a > > malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel > > to run the #VE handler with a userspace stack. > > > > The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE, > > but ISTs are terrible because they don't play nice with re-entrancy (among other > > reasons). The RSP used for IST-based handlers is hardcoded, and so if a #VE > > handler triggers another #VE at any point before IRET, the second #VE will clobber > > the stack and hose the kernel. > > v > > It's possible to workaround this, e.g. change the IST entry at the very beginning > > of the handler, but it's a maintenance burden. Since the only reason to use an IST > > is to guard against a malicious hypervisor, Linux decided it would be just as easy > > and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely. > > Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then? > > Having a hypervisor-controlled config bit to protect against a malicious > hypervisor looks pointless to me ... IIRC, all (most?) of the attributes are included in the attestation report, so a guest/customer can refuse to provision secrets to the guest if the hypervisor is misbehaving. I'm guessing Intel made it an attribute and not a dynamic control knob to simplify the TDX module implementation.