Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

"Huang, Kai" <kai.huang@xxxxxxxxx> · Tue, 10 Oct 2023 01:34:29 +0000

On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > +/**
> > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU
> > > > + * @lru:	LRU that is low
> > > > + *
> > > > + * Return:	%true if a victim was found and kicked.
> > > > + */
> > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > +{
> > > > +	struct sgx_epc_page *victim;
> > > > +
> > > > +	spin_lock(&lru->lock);
> > > > +	victim = sgx_oom_get_victim(lru);
> > > > +	spin_unlock(&lru->lock);
> > > > +
> > > > +	if (!victim)
> > > > +		return false;
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > +
> > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > +		return sgx_oom_encl(victim->encl);
> > > 
> > > I hate to bring this up, at least at this stage, but I am wondering why we need
> > > to put VA and SECS pages to the unreclaimable list, but cannot keep an
> > > "enclave_list" instead?
> > 
> > The motivation for tracking EPC pages instead of enclaves was so that the EPC
> > OOM-killer could "kill" VMs as well as host-owned enclaves.  
> > 
> 
> Ah this seems a fair argument. :-)
> 
> > The virtual EPC code
> > didn't actually kill the VM process, it instead just freed all of the EPC pages
> > and abused the SGX architecture to effectively make the guest recreate all its
> > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> 
> It returns SIGBUS.  SGX VM live migration also requires enough EPC being able to
> be allocated on the destination machine to work AFAICT.
>  
> > 
> > Looks like y'all punted on that with:
> > 
> >   The EPC pages allocated for KVM guests by the virtual EPC driver are not
> >   reclaimable by the host kernel [5]. Therefore they are not tracked by any
> >   LRU lists for reclaiming purposes in this implementation, but they are
> >   charged toward the cgroup of the user processs (e.g., QEMU) launching the
> >   guest.  And when the cgroup EPC usage reaches its limit, the virtual EPC
> >   driver will stop allocating more EPC for the VM, and return SIGBUS to the
> >   user process which would abort the VM launch.
> > 
> > which IMO is a hack, unless returning SIGBUS is actually enforced somehow.  
> > 
> 
> "enforced" do you mean?
> 
> Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate
> EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(),
> which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). 
> And Qemu then kills the VM with some nonsense message:
> 
>         error: kvm run failed Bad address
>         <dump guest registers nonsense>
> 
> > Relying
> > on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup
> > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered, userspace running
> > encalves in a VM could continue on and refuse to give up its EPC, and thus run above
> > its limit in perpetuity.
> 
> > 
> > I can see userspace wanting to explicitly terminate the VM instead of "silently"
> > the VM's enclaves, but that seems like it should be a knob in the virtual EPC
> > code.

I guess I slightly misunderstood your words.

You mean we want to kill VM when the limit is set to be lower than virtual EPC
size.

This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual EPC too?

In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if
it is set.

But this also requires keeping virtual EPC pages in some list, and handles them
in sgx_epc_oom() too.

And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a (separate?) list
instead of keeping the virtual EPC instance.

	struct sgx_epc_lru {
		struct list_head reclaimable;
		struct sgx_encl *enclaves;
		struct list_head vepc_pages;
	}

Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list?

I don't know :-)