Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 10 Oct 2023 00:50:33 +0000

On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > + * @lru:	LRU that is low
> > > + *
> > > + * Return:	%true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > +	struct sgx_epc_page *victim;
> > > +
> > > +	spin_lock(&lru->lock);
> > > +	victim = sgx_oom_get_victim(lru);
> > > +	spin_unlock(&lru->lock);
> > > +
> > > +	if (!victim)
> > > +		return false;
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > +		return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > +		return sgx_oom_encl(victim->encl);
> > 
> > I hate to bring this up, at least at this stage, but I am wondering why we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > "enclave_list" instead?
> 
> The motivation for tracking EPC pages instead of enclaves was so that the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves.  
> 

Ah this seems a fair argument. :-)

> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the EPC pages
> and abused the SGX architecture to effectively make the guest recreate all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).

It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
be allocated on the destination machine to work AFAICT.

> 
> Looks like y'all punted on that with:
> 
>   The EPC pages allocated for KVM guests by the virtual EPC driver are not
>   reclaimable by the host kernel [5]. Therefore they are not tracked by any
>   LRU lists for reclaiming purposes in this implementation, but they are
>   charged toward the cgroup of the user processs (e.g., QEMU) launching the
>   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
>   driver will stop allocating more EPC for the VM, and return SIGBUS to the
>   user process which would abort the VM launch.
> 
> which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> 

"enforced" do you mean?

Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
And Qemu then kills the VM with some nonsense message:

        error: kvm run failed Bad address
        <dump guest registers nonsense>

> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
> enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and thus run above
> its limit in perpetuity.

Agreed.  But this looks cannot resolved until we can reclaim EPC page from VM.

Or in the EPC cgroup code we can refuse to set the maximum which cannot be
supported, e.g., less the total virtual EPC size.

I guess the second is acceptable for now?

> 
> I can see userspace wanting to explicitly terminate the VM instead of "silently"
> the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> code.

See above for the second option.