On Wed, 2018-09-26 at 14:15 -0700, Andy Lutomirski wrote: > On Wed, Sep 26, 2018 at 1:55 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote: > > > > > > On 09/26/2018 01:44 PM, Sean Christopherson wrote: > > > > > > On Wed, Sep 26, 2018 at 01:16:59PM -0700, Dave Hansen wrote: > > > > > > > > We also need to clarify how this can happen. Is it through something > > > > than an app does, or is it solely when the hardware does something under > > > > the covers, like suspend/resume. > > > Are you looking for something in the changelog, the comment, or just > > > a response? If it's the latter... > > Comments, please. > > > > > > > > On bare metal with a bug-free kernel, the only scenario I'm aware of > > > where we'll encounter these faults is when hardware pulls the rug out > > > from under us. In a virtualized environment all bets are off because > > > the architecture allows VMMs to silently "destroy" the EPC at will, > > > e.g. KVM, and I believe Hyper-V, will take advantage of this behavior > > > to support live migration. Post migration, the destination system > > > will generate PF_SGX because the EPC{M} can't be migrated between > > > system, i.e. the destination EPCM sees all EPC pages as invalid. > > OK, cool. > > > > That's good background fodder for the changelog. > > > > But, for the comment, I'm happy with something like this: > > > > /* > > * The fault resulted from violation of SGX-specific access- > > * controls. This is expected to be the result of some lower > > * layer action (CPU suspend/resume, VM migration) and is > > * not related to anything the OS did. Treat it as an access > > * error to ensure it is passed up to the app via a signal where > > * it can be handled. > > */ > > > > I really don't think we need to delve too deeply into the relationship > > between EPCM and PTEs or anything. Let's just say, "it's not the > > kernel's fault, it's not the app's fault, so throw up our hands". > There is a non-nitpicky consideration here. Logically, user code is > going to do this (totally made-up pseudocode): > > enclave_t enclave = load_and_init_enclave(...); > int ret = sgx_run(enclave, some pointers to non-enclave-memory buffers, ...); > > and, with the code in this patch, a correct implementation of > sgx_run() requires installing a signal handler. This is nasty, since > signal handlers, expecially for something like SIGSEGV or SIGBUS, are > not fantastic to say the least in libraries. > > Could we perhaps have a little vDSO entry (or syscall, I suppose) that > runs an enclave an returns an error code, and rig up the #PF handler > to check if the error happened in the vDSO entry and fix it up rather > than sending a signal? If we want to avoid having to install a signal handler then I'm pretty sure we'd need to fixup all #GPs and "bad access" #PFs that occur on EENTER or in the enclave, not just PF_SGX faults. SGX1 hardware takes a #GP instead of a #PF on EPCM faults, and SGX2 hardware allows enclaves to allocate/free/adjust EPC pages at runtime, e.g. an enclave runtime might want to intercept #PFs from within the enclave so that the enclave can dynamically grow its stack. > On Windows, this is much less of a concern, because Windows has real > scoped fault handling. But Linux doesn't, at least not yet. > > > -- > Andy Lutomirski > AMA Capital Management, LLC