Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization

Zhiquan Li <zhiquan1.li@xxxxxxxxx> · Wed, 12 Oct 2022 13:09:25 +0800

On 2022/10/11 22:04, Dave Hansen wrote:
> On 9/19/22 23:39, Zhiquan Li wrote:
>> Today, if a guest accesses an SGX EPC page with memory failure,
>> the kernel behavior will kill the entire guest.  This blast
>> radius is too large.  It would be idea to kill only the SGX
> 
> 				ideal ^
> 
>> application inside the guest.
>>
>> To fix this, send a SIGBUS to host userspace (like QEMU) which can
>> follow up by injecting a #MC to the guest.
> 
> This doesn't make any sense to me.  It's *ALREADY* sending a SIGBUS.
> So, whatever is making this better, it's not "send a SIGBUS" that's
> doing it.
> 
> What does this patch actually do to reduce the blast radius?
> 

Thanks for your attention, Dave.

This part comes from your comments previously:

https://lore.kernel.org/linux-sgx/Yrf27fugD7lkyaek@xxxxxxxxxx/T/#m6d62670eb530fab178eefaaaf31a22c4475e818d

The key is the SIGBUS should with code BUS_MCEERR_AR and virtual address
of virtual EPC page. Hypervisor can identify it with the specific code
and inject #MC to the guest.

Can we change the statement like this?

    To fix this, send a SIGBUS with code BUS_MCEERR_AR and virtual
    address of virtual EPC page to host userspace (like QEMU) which can
    follow up by injecting a #MC to the guest.

>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>> being shared by multiple VMs via fork().  However KVM doesn't support
>> running a VM across multiple mm structures, and the de facto userspace
>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>> this should not happen.
> 
> This is out of the blue.  Why is this here?
> 
> What happens if a hypervisor *DOES* fork()?  What's the fallout?

This part originates from below discussion:

https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@xxxxxxxxx/#t

It intents to answer the question:

    Do you think the processes sharing the same enclave need to be
    killed, even they had not touched the EPC page with hardware error?

Dave, do you mean it's not appropriate to be put here?

Best Regards,
Zhiquan