Re: [PATCH 0/4] x86/sgx: fine grained SGX MCA behavior

Jarkko Sakkinen <jarkko@xxxxxxxxxx> · Fri, 13 May 2022 17:38:09 +0300

On Thu, May 12, 2022 at 08:03:30PM +0800, Zhiquan Li wrote:
> 
> On 2022/5/11 18:29, Jarkko Sakkinen wrote:
> > On Tue, May 10, 2022 at 11:16:46AM +0800, Zhiquan Li wrote:
> >> Hi everyone,
> >>
> >> This series contains a few patches to fine grained SGX MCA behavior.
> >>
> >> When VM guest access a SGX EPC page with memory failure, current
> >> behavior will kill the guest, expected only kill the SGX application
> >> inside it.
> >>
> >> To fix it we send SIGBUS with code BUS_MCEERR_AR and some extra
> >> information for hypervisor to inject #MC information to guest, which
> >> is helpful in SGX virtualization case.
> >>
> >> However, current SGX data structures are insufficient to track the
> >> EPC pages for vepc, so we introduce a new struct sgx_vepc_page which
> >> can be the owner of EPC pages for vepc and saves the useful info of
> >> EPC pages for vepc, like struct sgx_encl_page.
> >>
> >> Moreover, canonical memory failure collects victim tasks by iterating
> >> all the tasks one by one and use reverse mapping to get victim tasks’
> >> virtual address. This is not necessary for SGX - as one EPC page can
> >> be mapped to ONE enclave only. So, this 1:1 mapping enforcement
> >> allows us to find task virtual address with physical address
> >> directly.
> > 
> > Hmm... An enclave can be shared by multiple processes. The virtual
> > address is the same but there can be variable number of processes
> > having it mapped.
> 
> Thanks for your review, Jarkko.
> You’re right, enclave can be shared.
> 
> Actually, we had discussed this issue internally. Assuming below
> scenario:
> An enclave provides multiple ecalls and services for several tasks. If
> one task invokes an ecall and meets MCE, but the other tasks would not
> use that ecall, shall we kill all the sharing tasks immediately? It looks
> a little abrupt. Maybe it’s better to kill them when they really meet the
> HW poison page.
> Furthermore, once an EPC page has been poisoned, it will not be allocated
> anymore, so it would not be propagated.
> Therefore, we minimized the changes, just fine grained the behavior of
> SIGBUG and kept the other behavior as before.
> 
> Do you think the processes sharing the same enclave need to be killed,
> even they had not touched the EPC page with hardware error?
> Any ideas are welcome.

I do not think the patch set is going to wrong direction. This discussion
was just missing from the cover letter.

BR, Jarkko