> +/* > + * Get the lower bound of limits of a cgroup and its ancestors. Used in > + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup is > + * over its limit or its ancestors' hence reclamation is needed. > + */ > +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup *epc_cg) > +{ > + struct misc_cg *i = epc_cg->cg; > + u64 m = U64_MAX; > + > + while (i) { > + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); > + i = misc_cg_parent(i); > + } > + > + return m / PAGE_SIZE; > +} I am not sure, but is it possible or legal for an ancestor to have less limit than children? > + > /** > - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page > + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its LRUs > + * @root: Root of the tree to check > * > + * Return: %true if all cgroups under the specified root have empty LRU lists. > + * Used to avoid livelocks due to a cgroup having a non-zero charge count but > + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or > + * because all pages in the cgroup are unreclaimable. > + */ > +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root) > +{ > + struct cgroup_subsys_state *css_root; > + struct cgroup_subsys_state *pos; > + struct sgx_epc_cgroup *epc_cg; > + bool ret = true; > + > + /* > + * Caller ensure css_root ref acquired > + */ > + css_root = &root->css; > + > + rcu_read_lock(); > + css_for_each_descendant_pre(pos, css_root) { > + if (!css_tryget(pos)) > + break; > + > + rcu_read_unlock(); > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); > + > + spin_lock(&epc_cg->lru.lock); > + ret = list_empty(&epc_cg->lru.reclaimable); > + spin_unlock(&epc_cg->lru.lock); > + > + rcu_read_lock(); > + css_put(pos); > + if (!ret) > + break; > + } > + > + rcu_read_unlock(); > + > + return ret; > +} > + > +/** > + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to reclaim pages > + * @root: Root of the tree to start walking from. > + * Return: Number of pages reclaimed. Just wondering, do you need to return @cnt given this function is called w/o checking the return value? > + */ > +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) > +{ > + /* > + * Attempting to reclaim only a few pages will often fail and is > + * inefficient, while reclaiming a huge number of pages can result in > + * soft lockups due to holding various locks for an extended duration. > + */ Not sure we need this comment, given it's already implied in sgx_reclaim_pages(). You cannot pass a value > SGX_NR_TO_SCAN anyway. > + unsigned int nr_to_scan = SGX_NR_TO_SCAN; > + struct cgroup_subsys_state *css_root; > + struct cgroup_subsys_state *pos; > + struct sgx_epc_cgroup *epc_cg; > + unsigned int cnt; > + > + /* Caller ensure css_root ref acquired */ > + css_root = &root->css; > + > + cnt = 0; > + rcu_read_lock(); > + css_for_each_descendant_pre(pos, css_root) { > + if (!css_tryget(pos)) > + break; > + rcu_read_unlock(); > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); > + cnt += sgx_reclaim_pages(&epc_cg->lru, &nr_to_scan); > + > + rcu_read_lock(); > + css_put(pos); > + if (!nr_to_scan) > + break; > + } > + > + rcu_read_unlock(); > + return cnt; > +} Here the @nr_to_scan is reduced by the number of pages that are isolated, but not actually reclaimed (which is reflected by @cnt). IIUC, looks you want to make this function do "each cycle" as what you mentioned in the v8 [1]: I tested with that approach and found we can only target number of pages attempted to reclaim not pages actually reclaimed due to the uncertainty of how long it takes to reclaim pages. Besides targeting number of scanned pages for each cycle is also what the ksgxd does. If we target actual number of pages, sometimes it just takes too long. I saw more timeouts with the default time limit when running parallel selftests. I am not sure what does "sometimes it just takes too long" mean, but what I am thinking is you are trying to do some perfect but yet complicated code here. For instance, I don't think selftest reflect the real workload, and I believe adjusting the limit of a given EPC cgroup shouldn't be a frequent operation, thus it is acceptable to use some easy-maintain code but less perfect code. Here I still think having @nr_to_scan as a pointer is over-complicated. For example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN pages, but give up when there's enough pages reclaimed or when the EPC cgroup and its descendants have been looped: unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) { unsigned int cnt = 0; ... css_for_each_descendant_pre(pos, css_root) { ... epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); cnt += sgx_reclaim_pages(&epc_cg->lru); if (cnt >= SGX_NR_TO_SCAN) break; } ... return cnt; } Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually reaches any descendants, but that should be rare and we don't care that much, do we? But I'll leave to maintainers to judge. [1] https://lore.kernel.org/linux-kernel/CZ3CM9ZE39Q0.222HRSEUF8RFP@xxxxxxxxxx/T/#md7b062b43d249218369f921682dfa7f975735dd1 > + > +/* > + * Scheduled by sgx_epc_cgroup_try_charge() to reclaim pages from the cgroup > + * when the cgroup is at/near its maximum capacity > + */ I don't see this being "scheduled by sgx_epc_cgroup_try_charge()" here. Does it make more sense to move that code change to this patch for better review? > +static void sgx_epc_cgroup_reclaim_work_func(struct work_struct *work) > +{ > + struct sgx_epc_cgroup *epc_cg; > + u64 cur, max; > + > + epc_cg = container_of(work, struct sgx_epc_cgroup, reclaim_work); > + > + for (;;) { > + max = sgx_epc_cgroup_max_pages_to_root(epc_cg); > + > + /* > + * Adjust the limit down by one page, the goal is to free up > + * pages for fault allocations, not to simply obey the limit. > + * Conditionally decrementing max also means the cur vs. max > + * check will correctly handle the case where both are zero. > + */ > + if (max) > + max--; With the below max -= SGX_NR_TO_SCAN/2 staff, do you still need this one? > + > + /* > + * Unless the limit is extremely low, in which case forcing > + * reclaim will likely cause thrashing, force the cgroup to > + * reclaim at least once if it's operating *near* its maximum > + * limit by adjusting @max down by half the min reclaim size. OK. But why choose "SGX_NO_TO_SCAN * 2" as "extremely low"? E.g, could we choose SGX_NR_TO_SCAN instead? IMHO at least we should at least put a comment to mention this. And maybe you can have a dedicated macro for that in which way I believe the code would be easier to understand? > + * This work func is scheduled by sgx_epc_cgroup_try_charge This has been mentioned in the function comment already. > + * when it cannot directly reclaim due to being in an atomic > + * context, e.g. EPC allocation in a fault handler. > Why a fault handler is an "atomic context"? Just say when it cannot directly reclaim. > Waiting > + * to reclaim until the cgroup is actually at its limit is less > + * performant as it means the faulting task is effectively > + * blocked until a worker makes its way through the global work > + * queue. > + */ > + if (max > SGX_NR_TO_SCAN * 2) > + max -= (SGX_NR_TO_SCAN / 2); > + > + cur = sgx_epc_cgroup_page_counter_read(epc_cg); > + > + if (cur <= max || sgx_epc_cgroup_lru_empty(epc_cg->cg)) > + break; > + > + /* Keep reclaiming until above condition is met. */ > + sgx_epc_cgroup_reclaim_pages(epc_cg->cg); Also, each loop here calls sgx_epc_cgroup_max_pages_to_root() and sgx_epc_cgroup_lru_empty(), both loop the given EPC cgroup and descendants. If we still make sgx_reclaim_pages() always scan SGX_NR_TO_SCAN pages, seems we can reduce the number of loops here? > + } > +} > + > +/** > + * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page > * @epc_cg: The EPC cgroup to be charged for the page. > * Return: > * * %0 - If successfully charged. > @@ -38,6 +209,7 @@ static void sgx_epc_cgroup_free(struct misc_cg *cg) > if (!epc_cg) > return; > > + cancel_work_sync(&epc_cg->reclaim_work); > kfree(epc_cg); > } > > @@ -50,6 +222,8 @@ const struct misc_res_ops sgx_epc_cgroup_ops = { > > static void sgx_epc_misc_init(struct misc_cg *cg, struct sgx_epc_cgroup *epc_cg) > { > + sgx_lru_init(&epc_cg->lru); > + INIT_WORK(&epc_cg->reclaim_work, sgx_epc_cgroup_reclaim_work_func); > cg->res[MISC_CG_RES_SGX_EPC].priv = epc_cg; > epc_cg->cg = cg; > } > @@ -69,6 +243,11 @@ static int sgx_epc_cgroup_alloc(struct misc_cg *cg) > > void sgx_epc_cgroup_init(void) > { > + sgx_epc_cg_wq = alloc_workqueue("sgx_epc_cg_wq", > + WQ_UNBOUND | WQ_FREEZABLE, > + WQ_UNBOUND_MAX_ACTIVE); > + BUG_ON(!sgx_epc_cg_wq); You cannot BUG_ON() simply due to unable to allocate a workqueue. You can use some way to mark EPC cgroup as disabled but keep going. Static key is one way although we cannot re-enable it at runtime. > + > misc_cg_set_ops(MISC_CG_RES_SGX_EPC, &sgx_epc_cgroup_ops); > sgx_epc_misc_init(misc_cg_root(), &epc_cg_root); > } > diff --git a/arch/x86/kernel/cpu/sgx/epc_cgroup.h b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > index 6b664b4c321f..e3c6a08f0ee8 100644 > --- a/arch/x86/kernel/cpu/sgx/epc_cgroup.h > +++ b/arch/x86/kernel/cpu/sgx/epc_cgroup.h > @@ -34,6 +34,8 @@ static inline void sgx_epc_cgroup_init(void) { } > #else > struct sgx_epc_cgroup { > struct misc_cg *cg; > + struct sgx_epc_lru_list lru; > + struct work_struct reclaim_work; > }; So you introduced the work/workqueue here but there's no place which actually queues the work. IMHO you can either: 1) move relevant code change here; or 2) focus on introducing core functions to reclaim certain pages from a given EPC cgroup w/o workqueue and introduce the work/workqueue in later patch. Makes sense? > > static inline struct sgx_epc_cgroup *sgx_epc_cgroup_from_misc_cg(struct misc_cg *cg) > @@ -66,6 +68,7 @@ static inline void sgx_put_epc_cg(struct sgx_epc_cgroup *epc_cg) > > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg); > void sgx_epc_cgroup_uncharge(struct sgx_epc_cgroup *epc_cg); > +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root); Not sure why this needs to be exposed. Perhaps you should make this change when needed. > void sgx_epc_cgroup_init(void); > > #endif