On Thu, May 12, 2022 at 08:03:30PM +0800, Zhiquan Li wrote: > > On 2022/5/11 18:29, Jarkko Sakkinen wrote: > > On Tue, May 10, 2022 at 11:16:46AM +0800, Zhiquan Li wrote: > >> Hi everyone, > >> > >> This series contains a few patches to fine grained SGX MCA behavior. > >> > >> When VM guest access a SGX EPC page with memory failure, current > >> behavior will kill the guest, expected only kill the SGX application > >> inside it. > >> > >> To fix it we send SIGBUS with code BUS_MCEERR_AR and some extra > >> information for hypervisor to inject #MC information to guest, which > >> is helpful in SGX virtualization case. > >> > >> However, current SGX data structures are insufficient to track the > >> EPC pages for vepc, so we introduce a new struct sgx_vepc_page which > >> can be the owner of EPC pages for vepc and saves the useful info of > >> EPC pages for vepc, like struct sgx_encl_page. > >> > >> Moreover, canonical memory failure collects victim tasks by iterating > >> all the tasks one by one and use reverse mapping to get victim tasks’ > >> virtual address. This is not necessary for SGX - as one EPC page can > >> be mapped to ONE enclave only. So, this 1:1 mapping enforcement > >> allows us to find task virtual address with physical address > >> directly. > > > > Hmm... An enclave can be shared by multiple processes. The virtual > > address is the same but there can be variable number of processes > > having it mapped. > > Thanks for your review, Jarkko. > You’re right, enclave can be shared. > > Actually, we had discussed this issue internally. Assuming below > scenario: > An enclave provides multiple ecalls and services for several tasks. If > one task invokes an ecall and meets MCE, but the other tasks would not > use that ecall, shall we kill all the sharing tasks immediately? It looks > a little abrupt. Maybe it’s better to kill them when they really meet the > HW poison page. > Furthermore, once an EPC page has been poisoned, it will not be allocated > anymore, so it would not be propagated. > Therefore, we minimized the changes, just fine grained the behavior of > SIGBUG and kept the other behavior as before. > > Do you think the processes sharing the same enclave need to be killed, > even they had not touched the EPC page with hardware error? > Any ideas are welcome. I do not think the patch set is going to wrong direction. This discussion was just missing from the cover letter. BR, Jarkko