Re: [PATCH 0/4] x86/sgx: fine grained SGX MCA behavior

Zhiquan Li <zhiquan1.li@xxxxxxxxx> · Thu, 12 May 2022 20:03:30 +0800

On 2022/5/11 18:29, Jarkko Sakkinen wrote:
> On Tue, May 10, 2022 at 11:16:46AM +0800, Zhiquan Li wrote:
>> Hi everyone,
>>
>> This series contains a few patches to fine grained SGX MCA behavior.
>>
>> When VM guest access a SGX EPC page with memory failure, current
>> behavior will kill the guest, expected only kill the SGX application
>> inside it.
>>
>> To fix it we send SIGBUS with code BUS_MCEERR_AR and some extra
>> information for hypervisor to inject #MC information to guest, which
>> is helpful in SGX virtualization case.
>>
>> However, current SGX data structures are insufficient to track the
>> EPC pages for vepc, so we introduce a new struct sgx_vepc_page which
>> can be the owner of EPC pages for vepc and saves the useful info of
>> EPC pages for vepc, like struct sgx_encl_page.
>>
>> Moreover, canonical memory failure collects victim tasks by iterating
>> all the tasks one by one and use reverse mapping to get victim tasks’
>> virtual address. This is not necessary for SGX - as one EPC page can
>> be mapped to ONE enclave only. So, this 1:1 mapping enforcement
>> allows us to find task virtual address with physical address
>> directly.
> 
> Hmm... An enclave can be shared by multiple processes. The virtual
> address is the same but there can be variable number of processes
> having it mapped.

Thanks for your review, Jarkko.
You’re right, enclave can be shared.

Actually, we had discussed this issue internally. Assuming below
scenario:
An enclave provides multiple ecalls and services for several tasks. If
one task invokes an ecall and meets MCE, but the other tasks would not
use that ecall, shall we kill all the sharing tasks immediately? It looks
a little abrupt. Maybe it’s better to kill them when they really meet the
HW poison page.
Furthermore, once an EPC page has been poisoned, it will not be allocated
anymore, so it would not be propagated.
Therefore, we minimized the changes, just fine grained the behavior of
SIGBUG and kept the other behavior as before.

Do you think the processes sharing the same enclave need to be killed,
even they had not touched the EPC page with hardware error?
Any ideas are welcome.

Best Regards,
Zhiquan

> 
>>
>> Then we extend the solution for the normal SGX case, so that the task
>> has opportunity to make further decision while EPC page has memory
>> failure.
>>
>> Tests:
>> 1. MCE injection test for SGX in VM.
>>    As we expected, the application was killed and VM was alive.
>> 2. MCE injection test for SGX on host.
>>    As we expected, the application received SIGBUS with extra info.
>> 3. Kernel selftest/sgx: PASS
>> 4. Internal SGX stress test: PASS
>> 5. kmemleak test: No memory leakage detected.
>>
>> Zhiquan Li (4):
>>   x86/sgx: Move struct sgx_vepc definition to sgx.h
>>   x86/sgx: add struct sgx_vepc_page to manage EPC pages for vepc
>>   x86/sgx: Fine grained SGX MCA behavior for virtualization
>>   x86/sgx: Fine grained SGX MCA behavior for normal case
>>
>>  arch/x86/kernel/cpu/sgx/main.c | 24 ++++++++++++++++++++++--
>>  arch/x86/kernel/cpu/sgx/sgx.h  | 12 ++++++++++++
>>  arch/x86/kernel/cpu/sgx/virt.c | 29 +++++++++++++++++++----------
>>  3 files changed, 53 insertions(+), 12 deletions(-)
>>
>> -- 
>> 2.25.1
>>
> 
> BR, Jarkko