On Mon, 09 Oct 2023 20:34:29 -0500, Huang, Kai <kai.huang@xxxxxxxxx> wrote:
On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote:
On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote:
> On Mon, Oct 09, 2023, Kai Huang wrote:
> > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote:
> > > +/**
> > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target
LRU
> > > + * @lru: LRU that is low
> > > + *
> > > + * Return: %true if a victim was found and kicked.
> > > + */
> > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru)
> > > +{
> > > + struct sgx_epc_page *victim;
> > > +
> > > + spin_lock(&lru->lock);
> > > + victim = sgx_oom_get_victim(lru);
> > > + spin_unlock(&lru->lock);
> > > +
> > > + if (!victim)
> > > + return false;
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_PAGE)
> > > + return sgx_oom_encl_page(victim->encl_page);
> > > +
> > > + if (victim->flags & SGX_EPC_OWNER_ENCL)
> > > + return sgx_oom_encl(victim->encl);
> >
> > I hate to bring this up, at least at this stage, but I am wondering
why we need
> > to put VA and SECS pages to the unreclaimable list, but cannot keep
an
> > "enclave_list" instead?
>
> The motivation for tracking EPC pages instead of enclaves was so that
the EPC
> OOM-killer could "kill" VMs as well as host-owned enclaves. >
Ah this seems a fair argument. :-)
> The virtual EPC code
> didn't actually kill the VM process, it instead just freed all of the
EPC pages
> and abused the SGX architecture to effectively make the guest
recreate all its
> enclaves (IIRC, QEMU does the same thing to "support" live migration).
It returns SIGBUS. SGX VM live migration also requires enough EPC
being able to
be allocated on the destination machine to work AFAICT.
>
> Looks like y'all punted on that with:
>
> The EPC pages allocated for KVM guests by the virtual EPC driver
are not
> reclaimable by the host kernel [5]. Therefore they are not tracked
by any
> LRU lists for reclaiming purposes in this implementation, but they
are
> charged toward the cgroup of the user processs (e.g., QEMU)
launching the
> guest. And when the cgroup EPC usage reaches its limit, the
virtual EPC
> driver will stop allocating more EPC for the VM, and return SIGBUS
to the
> user process which would abort the VM launch.
>
> which IMO is a hack, unless returning SIGBUS is actually enforced
somehow. >
"enforced" do you mean?
Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot
allocate
EPC page. And when this happens, KVM returns KVM_PFN_ERR_FAULT in
hva_to_pfn(),
which eventually results in KVM returning -EFAULT to userspace in
vcpu_run().
And Qemu then kills the VM with some nonsense message:
error: kvm run failed Bad address
<dump guest registers nonsense>
> Relying
> on userspace to be kind enough to kill its VMs kinda defeats the
purpose of cgroup
> enforcement. E.g. if the hard limit for a EPC cgroup is lowered,
userspace running
> encalves in a VM could continue on and refuse to give up its EPC, and
thus run above
> its limit in perpetuity.
>
> I can see userspace wanting to explicitly terminate the VM instead of
"silently"
> the VM's enclaves, but that seems like it should be a knob in the
virtual EPC
> code.
I guess I slightly misunderstood your words.
You mean we want to kill VM when the limit is set to be lower than
virtual EPC
size.
This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual
EPC too?
That flag is set for enclaves, do you mean we set similar flag in vepc
struct?
In the sgx_vepc_fault(), we check this flag at early time and return
SIGBUS if
it is set.
But this also requires keeping virtual EPC pages in some list, and
handles them
in sgx_epc_oom() too.
And for virtual EPC pages, I guess the "young" logic can be applied thus
probably it's better to keep the actual virtual EPC pages to a
(separate?) list
instead of keeping the virtual EPC instance.
struct sgx_epc_lru {
struct list_head reclaimable;
struct sgx_encl *enclaves;
struct list_head vepc_pages;
}
Or still tracking VA/SECS and virtual EPC pages in a single
unrecliamable list?
One LRU should be OK as we only need relative order in which they are
loaded?
If an VA page is in front of vEPC, we just kill host side enclave first
before disrupting VMs in the same group.
As the group is not in a good situation anyway so kernel just pick
something reasonable to force kill.
Also after rereading the sentences "The virtual EPC code didn't actually
kill the VM process, it instead just freed all of the EPC pages and
abused the SGX architecture to effectively make the guest recreate all
its enclaves..."
Maybe by "kill" vm, Sean means EREMOVE the vepc pages in the unreclaimable
LRU, which effectively make enclaves in guest receiving "EPC lost" error
and those enclaves are forced to be reloaded (all reasonable user space
impl should already handle that). Not sure about free *all* of EPC pages
though. we should just EREMOVE enough to bring down the usage. And disable
new allocation in sgx_vepc_fault as kai outlined above. It also means user
space needs to inject/pass the SIGBUS to guest (I'm not really familiar
with this part, or maybe it's already there?). @sean is that what you
mean? Sorry I've been slow on understanding this.
If this is the case, some may still think it's too disruptive to guest
because the guest did not have a chance to paging out less active enclave
pages. But it's user's limit to set so it is justifiable as long as we
document this behavior.
Thanks to both of you for great insights.
Haitao