On Fri, Sep 02, 2022 at 02:52:25AM +0000, Sean Christopherson wrote: > On Fri, Sep 02, 2022, Xiaoyao Li wrote: > > On 8/26/2022 1:57 PM, Gerd Hoffmann wrote: > > > Hi, > > > > For TD guest kernel, it has its own reason to turn SEPT_VE on or off. E.g., > > > > linux TD guest requires SEPT_VE to be disabled to avoid #VE on syscall gap > > > > [1]. > > > > > > Why is that a problem for a TD guest kernel? Installing exception > > > handlers is done quite early in the boot process, certainly before any > > > userspace code runs. So I think we should never see a syscall without > > > a #VE handler being installed. /me is confused. > > > > > > Or do you want tell me linux has no #VE handler? > > > > The problem is not "no #VE handler" and Linux does have #VE handler. The > > problem is Linux doesn't want any (or certain) exception occurrence in > > syscall gap, it's not specific to #VE. Frankly, I don't understand the > > reason clearly, it's something related to IST used in x86 Linux kernel. > > The SYSCALL gap issue is that because SYSCALL doesn't load RSP, the first instruction > at the SYSCALL entry point runs with a userspaced-controlled RSP. With TDX, a > malicious hypervisor can induce a #VE on the SYSCALL page and thus get the kernel > to run the #VE handler with a userspace stack. > > The "fix" is to use an IST for #VE so that a kernel-controlled RSP is loaded on #VE, > but ISTs are terrible because they don't play nice with re-entrancy (among other > reasons). The RSP used for IST-based handlers is hardcoded, and so if a #VE > handler triggers another #VE at any point before IRET, the second #VE will clobber > the stack and hose the kernel. > v > It's possible to workaround this, e.g. change the IST entry at the very beginning > of the handler, but it's a maintenance burden. Since the only reason to use an IST > is to guard against a malicious hypervisor, Linux decided it would be just as easy > and more beneficial to avoid unexpected #VEs due to unaccepted private pages entirely. Hmm, ok, but shouldn't the SEPT_VE bit *really* controlled by the guest then? Having a hypervisor-controlled config bit to protect against a malicious hypervisor looks pointless to me ... take care, Gerd