Hi Dave, On 4/28/2022 3:53 PM, Dave Hansen wrote: > On 4/28/22 15:20, Reinette Chatre wrote: >> Hi Dave, >> >> On 4/28/2022 2:30 PM, Dave Hansen wrote: >>> On 4/28/22 13:11, Reinette Chatre wrote: >> >>> Are there any transient, recoverable errors that can come back from >>> ELDU? If so, this makes a lot of sense. If not, then it doesn't make a >>> lot of sense to preserve the swapped-out content because they enclave is >>> going to die anyway. >> >> Good point. >> >> Theoretically ELDU could encounter a page fault while accessing the >> regions it needs to read from and write to. These faults are passed >> through and the instruction would return with a #PF that is >> propagated with the page fault handler returning SIGBUS. > > We don't have to worry about those, though, do we? We're operating > entirely on kernel mappings that won't cause #PF. Indeed, yes, I do not see how an ELDU error or fault is recoverable. > >> Even so, this flow also impacts the SGX2 flows that need to load pages from >> the backing store. In this case the kernel would pass it as an error >> (-EFAULT) to the runtime but it would not result in the >> enclave being killed. If it was a #PF that caused the issue then >> perhaps theoretically the SGX2 instruction has a chance of succeeding >> if the runtime attempts it again? > > How are the SGX2 flows different than what we have now? SGX2 uses the same flow as the page fault handler to load the page into the enclave. The only difference is that the SGX2 flow removed the VMA permission checks. See: https://lore.kernel.org/lkml/db3a14f2d2df7678dec23375d48c96b603f8cfb5.1649878359.git.reinette.chatre@xxxxxxxxx/ As per the trace printed in the WARN the issue being investigated is triggered by the ELDU run as part of the page fault handler, not the SGX2 flows. > > I also looked a little deeper at this transient failure problem. The > ELDU documentation also mentions a possible error code of: > > SGX_EPC_PAGE_CONFLICT > > It *looks* like there can be conflicts on the SECS page as well as the > EPC page being explicitly accessed. Is that a possible problem here? I went down this path myself. SGX_EPC_PAGE_CONFLICT is an error code supported by newer ELDUC - the ELDU used in current code would indeed #GP in this case. The SDM text describing ELDUC as "This leaf function behaves like ELDU but with improved conflict handling for oversubscription" really does seem relevant to the test that triggers this issue. I stopped pursuing this because from what I understand if SGX_EPC_PAGE_CONFLICT is encountered with commit 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") then it should also be encountered without it. The issue is not present with 08999b2489b4 ("x86/sgx: Free backing memory after faulting the enclave page") removed. I am thus currently investigating based on the assumption that the #GP is encountered because of MAC verification problem. I may be wrong here also and need more information since the SDM documents two seemingly related errors: #GP(0) -> If the instruction fails to verify MAC. SGX_MAC_COMPARE_FAIL -> If the MAC check fails. Reinette