Re: [PATCH 5/7] x86/sgx: Add flag to zero added region instead of copying from source

Sean Christopherson <sean.j.christopherson@xxxxxxxxx> · Thu, 6 Jun 2019 10:32:43 -0700

On Thu, Jun 06, 2019 at 10:20:38AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 5, 2019 at 12:49 PM Sean Christopherson
> <sean.j.christopherson@xxxxxxxxx> wrote:
> >
> > For some enclaves, e.g. an enclave with a small code footprint and a
> > large working set, the vast majority of pages added to the enclave are
> > zero pages.  Introduce a flag to denote such zero pages.  The major
> > benefit of the flag will be realized in a future patch to use Linux's
> > actual zero page as the source, as opposed to explicitly zeroing the
> > enclave's backing memory.
> >
> 
> I feel like I probably asked this at some point, but why is there a
> workqueue here at all?

Performance.  A while back I wrote a patch set to remove the worker queue
and discovered that it tanks enclave build time when the enclave is being
hosted by a Golang application.  Here's a snippet from a mail discussing
the code.

    The bad news is that I don't think we can remove the add page worker
    as applications with userspace schedulers, e.g. Go's M:N scheduler,
    can see a 10x or more throughput improvement when using the worker
    queue.  I did a bit of digging for the Golang case to make sure I
    wasn't doing something horribly stupid/naive and found that it's a
    generic issue in Golang with blocking (or just long-running) system
    calls.  Because Golang multiplexes Goroutines on top of OS threads,
    blocking syscalls introduce latency and context switching overhead,
    e.g. Go's scheduler will spin up a new OS thread to service other
    Goroutines after it realizes the syscall has blocked, and will later
    destroy one of the OS threads so that it doesn't build up too many
    unused.

IIRC, the scenario is spinning up several goroutines, each building an
enclave.  I played around with adding a flag to do a synchronous EADD
but didn't see a meaningful change in performance for the simple case.
Supporting both the worker queue and direct paths was complex enough
that I decided it wasn't worth the trouble for initial upstreaming.