On Tue, Mar 23, 2021, Kai Huang wrote: > On Mon, 22 Mar 2021 23:37:26 +0100 Borislav Petkov wrote: > > "The instruction fails if the operand is not properly aligned or does > > not refer to an EPC page or the page is in use by another thread, or > > other threads are running in the enclave to which the page belongs. In > > addition the instruction fails if the operand refers to an SECS with > > associations." > > > > And I guess those conditions will become more in the future. Yep, IME these types of bugs rarely, if ever, lead to isolated failures. > > Now, let's play. I'm the cloud admin and you're cloud OS customer > > support. I say: > > > > "I got this scary error message while running enclaves on my server > > > > "EREMOVE returned ... . EPC page leaked. Reboot required to retrieve leaked pages." > > > > but I cannot reboot that machine because there are guests running on it > > and I'm getting paid for those guests and I might get sued if I do?" > > > > Your turn, go wild. > > I suppose admin can migrate those VMs, and then engineers can analyse the root > cause of such failure, and then fix it. That's more than likely what will happen, though there are a lot of "ifs" and "buts" in any answer, e.g. things will go downhill fast if the majority of systems in the fleet are running the buggy kernel and are triggering the error. Practically speaking, "basic" deployments of SGX VMs will be insulated from this bug. KVM doesn't support EPC oversubscription, so even if all EPC is exhausted, new VMs will fail to launch, but existing VMs will continue to chug along with no ill effects. There are again caveats, e.g. if EPC is being lazily allocated for VMs, then running VMs will be affected if a VM starts using SGX after the leak in the host occurs. But, IMO doing lazy allocation _and_ running enclaves in the host falls firmly into the "advanced" bucket; anyone going that route had better do their homework to understand the various EPC interactions.