On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote: > SGX Enclave Page Cache (EPC) memory allocations are separate from normal RAM allocations, and > are managed solely by the SGX subsystem. The existing cgroup memory controller cannot be used > to limit or account for SGX EPC memory, which is a desirable feature in some environments, > e.g., support for pod level control in a Kubernates cluster on a VM or baremetal host [1,2]. > > This patchset implements the support for sgx_epc memory within the misc cgroup controller. The > user can use the misc cgroup controller to set and enforce a max limit on total EPC usage per > cgroup. The implementation reports current usage and events of reaching the limit per cgroup as > well as the total system capacity. > > With the EPC misc controller enabled, every EPC page allocation is accounted for a cgroup's > usage, reflected in the 'sgx_epc' entry in the 'misc.current' interface file of the cgroup. > Much like normal system memory, EPC memory can be overcommitted via virtual memory techniques > and pages can be swapped out of the EPC to their backing store (normal system memory allocated > via shmem, accounted by the memory controller). When the EPC usage of a cgroup reaches its hard > limit ('sgx_epc' entry in the 'misc.max' file), the cgroup starts a reclamation process to swap > out some EPC pages within the same cgroup and its descendant to their backing store. Although > the SGX architecture supports swapping for all pages, to avoid extra complexities, this > implementation does not support swapping for certain page types, e.g. Version Array(VA) pages, > and treat them as unreclaimable pages. When the limit is reached but nothing left in the > cgroup for reclamation, i.e., only unreclaimable pages left, any new EPC allocation in the > cgroup will result in an ENOMEM error. > > The EPC pages allocated for guest VMs by the virtual EPC driver are not reclaimable by the host > kernel [5]. Therefore they are also treated as unreclaimable from cgroup's point of view. And > the virtual EPC driver translates an ENOMEM error resulted from an EPC allocation request into > a SIGBUS to the user process. > > This work was originally authored by Sean Christopherson a few years ago, and previously > modified by Kristen C. Accardi to utilize the misc cgroup controller rather than a custom > controller. I have been updating the patches based on review comments since V2 [3, 4, 10], > simplified the implementation/design and fixed some stability issues found from testing. > > The patches are organized as following: > - Patches 1-3 are prerequisite misc cgroup changes for adding new APIs, structs, resource > types. > - Patch 4 implements basic misc controller for EPC without reclamation. > - Patches 5-9 prepare for per-cgroup reclamation. > * Separate out the existing infrastructure of tracking reclaimable pages > from the global reclaimer(ksgxd) to a newly created LRU list struct. > * Separate out reusable top-level functions for reclamation. > - Patch 10 adds support for per-cgroup reclamation. > - Patch 11 adds documentation for the EPC cgroup. > - Patch 12 adds test scripts. > > I appreciate your review and providing tags if appropriate. > > --- > V6: > - Dropped OOM killing path, only implement non-preemptive enforcement of max limit (Dave, Michal) > - Simplified reclamation flow by taking out sgx_epc_reclaim_control, forced reclamation by > ignoring 'age". > - Restructured patches: split misc API + resource types patch and the big EPC cgroup patch > (Kai, Michal) > - Dropped some Tested-by/Reviewed-by tags due to significant changes > - Added more selftests > > v5: > - Replace the manual test script with a selftest script. > - Restore the "From" tag for some patches to Sean (Kai) > - Style fixes (Jarkko) > > v4: > - Collected "Tested-by" from Mikko. I kept it for now as no functional changes in v4. > - Rebased on to v6.6_rc1 and reordered patches as described above. > - Separated out the bug fixes [7,8,9]. This series depend on those patches. (Dave, Jarkko) > - Added comments in commit message to give more preview what's to come next. (Jarkko) > - Fixed some documentation error, gap, style (Mikko, Randy) > - Fixed some comments, typo, style in code (Mikko, Kai) > - Patch format and background for reclaimable vs unreclaimable (Kai, Jarkko) > - Fixed typo (Pavel) > - Exclude the previous fixes/enhancements for self-tests. Patch 18 now depends on series [6] > - Use the same to list for cover and all patches. (Solo) > > v3: > > - Added EPC states to replace flags in sgx_epc_page struct. (Jarkko) > - Unrolled wrappers for cond_resched, list (Dave) > - Separate patches for adding reclaimable and unreclaimable lists. (Dave) > - Other improvements on patch flow, commit messages, styles. (Dave, Jarkko) > - Simplified the cgroup tree walking with plain > css_for_each_descendant_pre. > - Fixed race conditions and crashes. > - OOM killer to wait for the victim enclave pages being reclaimed. > - Unblock the user by handling misc_max_write callback asynchronously. > - Rebased onto 6.4 and no longer base this series on the MCA patchset. > - Fix an overflow in misc_try_charge. > - Fix a NULL pointer in SGX PF handler. > - Updated and included the SGX selftest patches previously reviewed. Those > patches fix issues triggered in high EPC pressure required for cgroup > testing. > - Added test scripts to help setup and test SGX EPC cgroups. > > [1]https://lore.kernel.org/all/DM6PR21MB11772A6ED915825854B419D6C4989@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > [2]https://lore.kernel.org/all/ZD7Iutppjj+muH4p@himmelriiki/ > [3]https://lore.kernel.org/all/20221202183655.3767674-1-kristen@xxxxxxxxxxxxxxx/ > [4]https://lore.kernel.org/linux-sgx/20230712230202.47929-1-haitao.huang@xxxxxxxxxxxxxxx/ > [5]Documentation/arch/x86/sgx.rst, Section "Virtual EPC" > [6]https://lore.kernel.org/linux-sgx/20220905020411.17290-1-jarkko@xxxxxxxxxx/ > [7]https://lore.kernel.org/linux-sgx/ZLcXmvDKheCRYOjG@xxxxxxxxxxxxxxx/ > [8]https://lore.kernel.org/linux-sgx/20230721120231.13916-1-haitao.huang@xxxxxxxxxxxxxxx/ > [9]https://lore.kernel.org/linux-sgx/20230728051024.33063-1-haitao.huang@xxxxxxxxxxxxxxx/ > [10]https://lore.kernel.org/all/20230923030657.16148-1-haitao.huang@xxxxxxxxxxxxxxx/ > > Haitao Huang (2): > x86/sgx: Introduce EPC page states > selftests/sgx: Add scripts for EPC cgroup testing > > Kristen Carlson Accardi (5): > cgroup/misc: Add per resource callbacks for CSS events > cgroup/misc: Export APIs for SGX driver > cgroup/misc: Add SGX EPC resource type > x86/sgx: Implement basic EPC misc cgroup functionality > x86/sgx: Implement EPC reclamation for cgroup > > Sean Christopherson (5): > x86/sgx: Add sgx_epc_lru_list to encapsulate LRU list > x86/sgx: Use sgx_epc_lru_list for existing active page list > x86/sgx: Use a list to track to-be-reclaimed pages > x86/sgx: Restructure top-level EPC reclaim function > Docs/x86/sgx: Add description for cgroup support > > Documentation/arch/x86/sgx.rst | 74 ++++ > arch/x86/Kconfig | 13 + > arch/x86/kernel/cpu/sgx/Makefile | 1 + > arch/x86/kernel/cpu/sgx/encl.c | 2 +- > arch/x86/kernel/cpu/sgx/epc_cgroup.c | 319 ++++++++++++++++++ > arch/x86/kernel/cpu/sgx/epc_cgroup.h | 49 +++ > arch/x86/kernel/cpu/sgx/main.c | 245 +++++++++----- > arch/x86/kernel/cpu/sgx/sgx.h | 88 ++++- > include/linux/misc_cgroup.h | 42 +++ > kernel/cgroup/misc.c | 52 ++- > .../selftests/sgx/run_epc_cg_selftests.sh | 196 +++++++++++ > .../selftests/sgx/watch_misc_for_tests.sh | 13 + > 12 files changed, 996 insertions(+), 98 deletions(-) > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.c > create mode 100644 arch/x86/kernel/cpu/sgx/epc_cgroup.h > create mode 100755 tools/testing/selftests/sgx/run_epc_cg_selftests.sh > create mode 100755 tools/testing/selftests/sgx/watch_misc_for_tests.sh > Is this expected to work on NUC7? Planning to test this next week (no time this week). BR, Jarkko