On Tue, 2023-10-10 at 00:50 +0000, Huang, Kai wrote: > On Mon, 2023-10-09 at 17:23 -0700, Sean Christopherson wrote: > > On Mon, Oct 09, 2023, Kai Huang wrote: > > > On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: > > > > +/** > > > > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU > > > > + * @lru: LRU that is low > > > > + * > > > > + * Return: %true if a victim was found and kicked. > > > > + */ > > > > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) > > > > +{ > > > > + struct sgx_epc_page *victim; > > > > + > > > > + spin_lock(&lru->lock); > > > > + victim = sgx_oom_get_victim(lru); > > > > + spin_unlock(&lru->lock); > > > > + > > > > + if (!victim) > > > > + return false; > > > > + > > > > + if (victim->flags & SGX_EPC_OWNER_PAGE) > > > > + return sgx_oom_encl_page(victim->encl_page); > > > > + > > > > + if (victim->flags & SGX_EPC_OWNER_ENCL) > > > > + return sgx_oom_encl(victim->encl); > > > > > > I hate to bring this up, at least at this stage, but I am wondering why we need > > > to put VA and SECS pages to the unreclaimable list, but cannot keep an > > > "enclave_list" instead? > > > > The motivation for tracking EPC pages instead of enclaves was so that the EPC > > OOM-killer could "kill" VMs as well as host-owned enclaves. > > > > Ah this seems a fair argument. :-) > > > The virtual EPC code > > didn't actually kill the VM process, it instead just freed all of the EPC pages > > and abused the SGX architecture to effectively make the guest recreate all its > > enclaves (IIRC, QEMU does the same thing to "support" live migration). > > It returns SIGBUS. SGX VM live migration also requires enough EPC being able to > be allocated on the destination machine to work AFAICT. > > > > > Looks like y'all punted on that with: > > > > The EPC pages allocated for KVM guests by the virtual EPC driver are not > > reclaimable by the host kernel [5]. Therefore they are not tracked by any > > LRU lists for reclaiming purposes in this implementation, but they are > > charged toward the cgroup of the user processs (e.g., QEMU) launching the > > guest. And when the cgroup EPC usage reaches its limit, the virtual EPC > > driver will stop allocating more EPC for the VM, and return SIGBUS to the > > user process which would abort the VM launch. > > > > which IMO is a hack, unless returning SIGBUS is actually enforced somehow. > > > > "enforced" do you mean? > > Currently the sgx_vepc_fault() returns VM_FAULT_SIGBUS when it cannot allocate > EPC page. And when this happens, KVM returns KVM_PFN_ERR_FAULT in hva_to_pfn(), > which eventually results in KVM returning -EFAULT to userspace in vcpu_run(). > And Qemu then kills the VM with some nonsense message: > > error: kvm run failed Bad address > <dump guest registers nonsense> > > > Relying > > on userspace to be kind enough to kill its VMs kinda defeats the purpose of cgroup > > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, userspace running > > encalves in a VM could continue on and refuse to give up its EPC, and thus run above > > its limit in perpetuity. > > > > > I can see userspace wanting to explicitly terminate the VM instead of "silently" > > the VM's enclaves, but that seems like it should be a knob in the virtual EPC > > code. I guess I slightly misunderstood your words. You mean we want to kill VM when the limit is set to be lower than virtual EPC size. This patch adds SGX_ENCL_NO_MEMORY. I guess we can use it for virtual EPC too? In the sgx_vepc_fault(), we check this flag at early time and return SIGBUS if it is set. But this also requires keeping virtual EPC pages in some list, and handles them in sgx_epc_oom() too. And for virtual EPC pages, I guess the "young" logic can be applied thus probably it's better to keep the actual virtual EPC pages to a (separate?) list instead of keeping the virtual EPC instance. struct sgx_epc_lru { struct list_head reclaimable; struct sgx_encl *enclaves; struct list_head vepc_pages; } Or still tracking VA/SECS and virtual EPC pages in a single unrecliamable list? I don't know :-)