Hi Dave, On 1/19/2022 11:51 AM, Dave Hansen wrote: > On 1/18/22 3:05 PM, Reinette Chatre wrote: >> The machine check recovery handling in SGX added the changes >> listed below to the freeing of pages in sgx_free_epc_page(). >> The SGX reclaimer contains an open coded version of >> sgx_free_epc_page() and thus did not obtain the changes in >> support of poison handling. > > I was trying to decide if this is an urgent fix or not. A more crisp > problem statement might have helped in the changelog. > > But, from what I can tell, the most probable troublesome scenario here > would be something like: > > 1. Machine check (#MC) occurs (asynchronous, !MF_ACTION_REQUIRED) > 2. arch_memory_failure() called is eventually > 3. (SGX) page->poison set to 1 > 4. Page is reclaimed > 5. Page added to normal free lists by sgx_reclaim_pages() > ^ This is the bug > 6. Page is reallocated by some innocent enclave, a second (synchronous) > in-kernel #MC is induced, probably during EADD instruction. > ^ This is the fallout from the bug > > #6 is unfortunate and can be avoided if this patch is applied. > > Basically, this patch ensures that a bad enclave page is isolated > quickly and causes a minimal amount of collateral damage. Is this a > valid summary? > > The SGX reclaimer code lacks page poison handling in its free > path. This can lead to completely avoidable machine checks if a > poisoned page is freed and reallocated instead of being > isolated. As I understand this code it does look like a valid summary to me. One detail is that the poison page handling is currently done for SECS pages when they are freed by the reclaimer (via sgx_reclaimer_write()). Thank you very much for the detailed analysis. Should I resend with an improved commit message that contains your scenario description and summary? Reinette