Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC

"Huang, Kai" <kai.huang@xxxxxxxxx> · Wed, 11 Oct 2023 00:51:17 +0000

On Tue, 2023-10-10 at 11:49 -0500, Haitao Huang wrote:
> On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@xxxxxxxxx> wrote:
> 
> > On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
> > > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> > > > On Mon, Oct 09, 2023, Kai Huang wrote:
> > > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > > > > +/**
> > > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target  
> > > LRU
> > > > > > + * @lru:	LRU that is low
> > > > > > + *
> > > > > > + * Return:	%true if a victim was found and kicked.
> > > > > > + */
> > > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > > > > +{
> > > > > > +	struct sgx_epc_page *victim;
> > > > > > +
> > > > > > +	spin_lock(&lru->lock);
> > > > > > +	victim = sgx_oom_get_victim(lru);
> > > > > > +	spin_unlock(&lru->lock);
> > > > > > +
> > > > > > +	if (!victim)
> > > > > > +		return false;
> > > > > > +
> > > > > > +	if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > > > > +		return sgx_oom_encl_page(victim->encl_page);
> > > > > > +
> > > > > > +	if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > > > > +		return sgx_oom_encl(victim->encl);
> > > > > 
> > > > > I hate to bring this up, at least at this stage, but I am wondering  
> > > why we need
> > > > > to put VA and SECS pages to the unreclaimable list, but cannot keep  
> > > an
> > > > > "enclave_list" instead?
> > > > 
> > > > The motivation for tracking EPC pages instead of enclaves was so that  
> > > the EPC
> > > > OOM-killer could "kill" VMs as well as host-owned enclaves. >
> > > 
> > > Ah this seems a fair argument. :-)
> > > 
> > > > The virtual EPC code
> > > > didn't actually kill the VM process, it instead just freed all of the  
> > > EPC pages
> > > > and abused the SGX architecture to effectively make the guest  
> > > recreate all its
> > > > enclaves (IIRC, QEMU does the same thing to "support" live migration).
> > > 
> > > It returns SIGBUS.  SGX VM live migration also requires enough EPC  
> > > being able to
> > > be allocated on the destination machine to work AFAICT.
> > > 
> > > > 
> > > > Looks like y'all punted on that with:
> > > > 
> > > >   The EPC pages allocated for KVM guests by the virtual EPC driver  
> > > are not
> > > >   reclaimable by the host kernel [5]. Therefore they are not tracked  
> > > by any
> > > >   LRU lists for reclaiming purposes in this implementation, but they  
> > > are
> > > >   charged toward the cgroup of the user processs (e.g., QEMU)  
> > > launching the
> > > >   guest.  And when the cgroup EPC usage reaches its limit, the  
> > > virtual EPC
> > > >   driver will stop allocating more EPC for the VM, and return SIGBUS  
> > > to the
> > > >   user process which would abort the VM launch.
> > > > 
> > > > which IMO is a hack, unless returning SIGBUS is actually enforced  
> > > somehow. >
> > > 
> > > "enforced" do you mean?
> > > 
> > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot  
> > > allocate
> > > EPC page.  And when this happens, KVM returns KVM_PFN_ERR_FAULT in  
> > > hva_to_pfn(),
> > > which eventually results in KVM returning -EFAULT to userspace in  
> > > vcpu_run().
> > > And Qemu then kills the VM with some nonsense message:
> > > 
> > >         error: kvm run failed Bad address
> > >         <dump guest registers nonsense>
> > > 
> > > > Relying
> > > > on userspace to be kind enough to kill its VMs kinda defeats the  
> > > purpose of cgroup
> > > > enforcement.  E.g. if the hard limit for a EPC cgroup is lowered,  
> > > userspace running
> > > > encalves in a VM could continue on and refuse to give up its EPC, and  
> > > thus run above
> > > > its limit in perpetuity.
> > > 
> > > > 
> > > > I can see userspace wanting to explicitly terminate the VM instead of  
> > > "silently"
> > > > the VM's enclaves, but that seems like it should be a knob in the  
> > > virtual EPC
> > > > code.
> > 
> > I guess I slightly misunderstood your words.
> > 
> > You mean we want to kill VM when the limit is set to be lower than  
> > virtual EPC
> > size.
> > 
> > This patch adds SGX_ENCL_NO_MEMORY.  I guess we can use it for virtual  
> > EPC too?
> > 
> 
> That flag is set for enclaves, do you mean we set similar flag in vepc  
> struct?
> 
> > In the sgx_vepc_fault(), we check this flag at early time and return  
> > SIGBUS if
> > it is set.
> > 
> > But this also requires keeping virtual EPC pages in some list, and  
> > handles them
> > in sgx_epc_oom() too.
> > 
> > And for virtual EPC pages, I guess the "young" logic can be applied thus
> > probably it's better to keep the actual virtual EPC pages to a  
> > (separate?) list
> > instead of keeping the virtual EPC instance.
> > 
> > 	struct sgx_epc_lru {
> > 		struct list_head reclaimable;
> > 		struct sgx_encl *enclaves;
> > 		struct list_head vepc_pages;
> > 	}
> > 
> > Or still tracking VA/SECS and virtual EPC pages in a single  
> > unrecliamable list?
> > 
> 
> One LRU should be OK as we only need relative order in which they are  
> loaded?

It's one way to do, but not the only way.

The disadvantage of using one unreclaimable list is, for VA/SECS pages you don't
have the concept of "young/age".  The only purpose of getting the page is to get
the owner enclave and kill it.

On the other hand, for virtual EPC pages we do have concept of "young/age",
although I think it's acceptable we don't implement this in the first version of
EPC cgroup support.

>From this point, perhaps it's better to  maintain VA/SECS (or enclaves)
separately from virtual EPC pages.  But it is not mandatory.

Another pro of maintaining them separately is you don't need these flags to
distinguish them from the single list: SGX_EPC_OWNER_{ENCL|PAGE|VEPC}, which I
dislike.

(btw, even you track VA/SECS pages in unreclaimable list, given they both have
'enclave' as the owner,  do you still need SGX_EPC_OWNER_ENCL and
SGX_EPC_OWNER_PAGE ?)

That being said, personally I'd slightly prefer to keep them in separate lists
but I can see it also depends on how you want to implement the algorithm of
selecting a victim.  So I am not going to insist for now but will leave to you
to decide.

Just remember to give justification of choosing so.

[...]

> It also means user  
> space needs to inject/pass the SIGBUS to guest (I'm not really familiar  
> with this part, or maybe it's already there?). @sean is that what you  
> mean? Sorry I've been slow on understanding this.

I don't think we need to do this for now.  Killing the VM is an acceptable start
to me.  Just make sure we can kill some VM.