Re: [PATCH v9 3/3] x86/sgx: Fine grained SGX MCA behavior for virtualization

Zhiquan Li <zhiquan1.li@xxxxxxxxx> · Thu, 13 Oct 2022 22:40:56 +0800

On 2022/10/12 22:36, Dave Hansen wrote:
> This is really just mechanically restating what the patch does.  It
> doesn't help me understand how it achieves the goal.  I guess I'll just
> go and write the changelog for you.  Here's what I was missing:
> 
> 	There is already a signal-based ABI to tell userspace about
> 	machine checks.  But, SGX does not use that ABI.  Today, the
> 	kernel delivers a generic SIGBUS if a machine check occurs when
> 	accessing SGX memory.  Userspace can not differentiate that
> 	SIGBUS from <add example here>, so it is very unlikely to be
> 	able to recover from the signal and the app will die.
> 
> 	To fix this, have the SGX machine check code generate a SIGBUS
> 	which leverages the existing BUS_MCEERR_AR ABI.  This enlightens
> 	userspace about why the SIGBUS was generated and gives it a
> 	chance of being able to handle the signal.
> 
> 	QEMU, for instance, has code to handle these BUS_MCEERR_AR
> 	signals today.  Without this patch, QEMU will just die in the
> 	face of a generic SIGBUS, and take the whole VM with it.  With
> 	this patch <explain what QEMU actually does here>.
> 
> 	In short, BUS_MCEERR_AR enables QEMU to reduce the blast radius
> 	down from the whole QEMU process to a single page.
> 
> This patch doesn't *actually* reduce the blast radius.  It enables QEMU
> to do that.
> 

Perfect, Dave! I'm ashamed to have to get your hands dirty in the end.
I'd like to replace the angle brackets content as below, could you have
a look?

	There is already a signal-based ABI to tell userspace about
	machine checks.  But, SGX does not use that ABI.  Today, the
	kernel delivers a generic SIGBUS if a machine check occurs when
	accessing SGX memory.  Userspace can not differentiate that
	SIGBUS from <a recoverable machine check error>, so it is very
	unlikely to be able to recover from the signal and the app will
	die.
	...
	QEMU, for instance, has code to handle these BUS_MCEERR_AR
	signals today.  Without this patch, QEMU will just die in the
	face of a generic SIGBUS, and take the whole VM with it.  With
	this patch <QEMU will be aware of the machine check error and
	retrieve the virtual address (HVA) from siginfo, then convert it
	to GPA and inject a #MC into VM. Guest kernel will handle this
	#MC, find and kill the task who had accessed the error memory.>
	...

>>>> SGX virtual EPC driver doesn't explicitly prevent virtual EPC instance
>>>> being shared by multiple VMs via fork().  However KVM doesn't support
>>>> running a VM across multiple mm structures, and the de facto userspace
>>>> hypervisor (Qemu) doesn't use fork() to create a new VM, so in practice
>>>> this should not happen.
>>> This is out of the blue.  Why is this here?
>>>
>>> What happens if a hypervisor *DOES* fork()?  What's the fallout?
>> This part originates from below discussion:
>>
>> https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@xxxxxxxxx/#t
>>
>> It intents to answer the question:
>>
>>     Do you think the processes sharing the same enclave need to be
>>     killed, even they had not touched the EPC page with hardware error?
>>
>> Dave, do you mean it's not appropriate to be put here?
> It's actually a pretty important point, but it's still out of the blue.
> 
> You also didn't answer my question.

Oh, sorry, I focused on answering "Why is this here?" but forgot to
answer "What's the fallout?"

It's a very good question.

Looks like Kai had answered it before:

	For instance, an application can open /dev/sgx_vepc, mmap(), and
	fork().  Then if the child does mmap() against the fd opened by
	the parent, the child will share the same vepc with the parent.

	...

	Sharing virtual EPC instance will very likely unexpectedly break
	enclaves in all VMs.
	...

https://lore.kernel.org/linux-sgx/52dc7f50b68c99cecb9e1c3383d9c6d88734cd67.camel@xxxxxxxxx/#t

Moreover, I checked the code in below scenario while child sharing the
virtual EPC instance with parent:
- child closes /dev/sgx_vepc earlier than parent.
- child re-mmap() /dev/sgx_vepc to the same memory region as parent.
- child mmap() /dev/sgx_vepc to different memory region.
- child accesses the memory region of mmap() inherited from parent.

It's just that the app messes itself up, the vepc instance is protected
very well.
Maybe there are other corner cases I've not considered?

Best Regards,
Zhiquan