On Fri, Jan 14, 2022 at 11:25:47PM +0200, Jarkko Sakkinen wrote: > On Wed, Jan 12, 2022 at 10:08:02PM -0800, Reinette Chatre wrote: > > Hi Jarkko, > > > > On 1/8/2022 6:05 AM, Jarkko Sakkinen wrote: > > > There is a limited amount of SGX memory (EPC) on each system. When that > > > memory is used up, SGX has its own swapping mechanism which is similar > > > in concept but totally separate from the core mm/* code. Instead of > > > swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM > > > comes from a shared memory pseudo-file and can itself be swapped by the > > > core mm code. There is a hierarchy like this: > > > > > > EPC <-> shmem <-> disk > > > > > > After data is swapped back in from shmem to EPC, the shmem backing > > > storage needs to be freed. Currently, the backing shmem is not freed. > > > This effectively wastes the shmem while the enclave is running. The > > > memory is recovered when the enclave is destroyed and the backing > > > storage freed. > > > > > > Sort this out by freeing memory with shmem_truncate_range(), as soon as > > > a page is faulted back to the EPC. In addition, free the memory for > > > PCMD pages as soon as all PCMD's in a page have been marked as unused > > > by zeroing its contents. > > > > > > Reported-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> > > > Cc: stable@xxxxxxxxxxxxxxx > > > Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") > > > Signed-off-by: Jarkko Sakkinen <jarkko@xxxxxxxxxx> > > > --- > > > v3: > > > * Resend. > > > v2: > > > * Rewrite commit message as proposed by Dave. > > > * Truncate PCMD pages (Dave). > > > --- > > > arch/x86/kernel/cpu/sgx/encl.c | 48 +++++++++++++++++++++++++++++++--- > > > 1 file changed, 44 insertions(+), 4 deletions(-) > > > > > > diff --git a/arch/x86/kernel/cpu/sgx/encl.c b/arch/x86/kernel/cpu/sgx/encl.c > > > index 001808e3901c..ea43c10e5458 100644 > > > --- a/arch/x86/kernel/cpu/sgx/encl.c > > > +++ b/arch/x86/kernel/cpu/sgx/encl.c > > > @@ -12,6 +12,27 @@ > > > #include "encls.h" > > > #include "sgx.h" > > > > > > + > > > +/* > > > + * Get the page number of the page in the backing storage, which stores the PCMD > > > + * of the enclave page in the given page index. PCMD pages are located after > > > + * the backing storage for the visible enclave pages and SECS. > > > + */ > > > +static inline pgoff_t sgx_encl_get_backing_pcmd_nr(struct sgx_encl *encl, pgoff_t index) > > > +{ > > > + return PFN_DOWN(encl->size) + 1 + (index / sizeof(struct sgx_pcmd)); > > > +} > > > + > > > +/* > > > + * Free a page from the backing storage in the given page index. > > > + */ > > > +static inline void sgx_encl_truncate_backing_page(struct sgx_encl *encl, pgoff_t index) > > > +{ > > > + struct inode *inode = file_inode(encl->backing); > > > + > > > + shmem_truncate_range(inode, PFN_PHYS(index), PFN_PHYS(index) + PAGE_SIZE - 1); > > > +} > > > + > > > /* > > > * ELDU: Load an EPC page as unblocked. For more info, see "OS Management of EPC > > > * Pages" in the SDM. > > > @@ -24,7 +45,10 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, > > > struct sgx_encl *encl = encl_page->encl; > > > struct sgx_pageinfo pginfo; > > > struct sgx_backing b; > > > + bool pcmd_page_empty; > > > pgoff_t page_index; > > > + pgoff_t pcmd_index; > > > + u8 *pcmd_page; > > > int ret; > > > > > > if (secs_page) > > > @@ -38,8 +62,8 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, > > > > > > pginfo.addr = encl_page->desc & PAGE_MASK; > > > pginfo.contents = (unsigned long)kmap_atomic(b.contents); > > > - pginfo.metadata = (unsigned long)kmap_atomic(b.pcmd) + > > > - b.pcmd_offset; > > > + pcmd_page = kmap_atomic(b.pcmd); > > > + pginfo.metadata = (unsigned long)pcmd_page + b.pcmd_offset; > > > > > > if (secs_page) > > > pginfo.secs = (u64)sgx_get_epc_virt_addr(secs_page); > > > @@ -55,11 +79,27 @@ static int __sgx_encl_eldu(struct sgx_encl_page *encl_page, > > > ret = -EFAULT; > > > } > > > > > > - kunmap_atomic((void *)(unsigned long)(pginfo.metadata - b.pcmd_offset)); > > > + memset(pcmd_page + b.pcmd_offset, 0, sizeof(struct sgx_pcmd)); > > > + > > > + /* > > > + * The area for the PCMD in the page was zeroed above. Check if the > > > + * whole page is now empty meaning that all PCMD's have been zeroed: > > > + */ > > > + pcmd_page_empty = !memchr_inv(pcmd_page, 0, PAGE_SIZE); > > > + > > > + kunmap_atomic(pcmd_page); > > > kunmap_atomic((void *)(unsigned long)pginfo.contents); > > > > > > sgx_encl_put_backing(&b, false); > > > > > > + /* Free the backing memory. */ > > > + sgx_encl_truncate_backing_page(encl, page_index); > > > + > > > + if (pcmd_page_empty) { > > > + pcmd_index = sgx_encl_get_backing_pcmd_nr(encl, page_index); > > > + sgx_encl_truncate_backing_page(encl, pcmd_index); > > > + } > > > + > > > return ret; > > > } > > > > > > @@ -577,7 +617,7 @@ static struct page *sgx_encl_get_backing_page(struct sgx_encl *encl, > > > int sgx_encl_get_backing(struct sgx_encl *encl, unsigned long page_index, > > > struct sgx_backing *backing) > > > { > > > - pgoff_t pcmd_index = PFN_DOWN(encl->size) + 1 + (page_index >> 5); > > > + pgoff_t pcmd_index = sgx_encl_get_backing_pcmd_nr(encl, page_index); > > > struct page *contents; > > > struct page *pcmd; > > > > > > > I applied this patch on top of commit 2056e2989bf4 ("x86/sgx: Fix NULL pointer > > dereference on non-SGX systems") found on branch x86/sgx of the tip repo. > > > > When I run the SGX selftests the new oversubscription test case is failing with > > the error below: > > ./test_sgx > > TAP version 13 > > 1..6 > > # Starting 6 tests from 2 test cases. > > # RUN enclave.unclobbered_vdso ... > > # OK enclave.unclobbered_vdso > > ok 1 enclave.unclobbered_vdso > > # RUN enclave.unclobbered_vdso_oversubscribed ... > > # main.c:330:unclobbered_vdso_oversubscribed:Expected (&self->run)->function (2) == EEXIT (4) > > # main.c:330:unclobbered_vdso_oversubscribed:0x0e 0x06 0x00007f6000000fff > > # main.c:338:unclobbered_vdso_oversubscribed:Expected get_op.value (0) == MAGIC (1234605616436508552) > > # main.c:339:unclobbered_vdso_oversubscribed:Expected (&self->run)->function (2) == EEXIT (4) > > # main.c:339:unclobbered_vdso_oversubscribed:0x0e 0x06 0x00007f6000000fff > > # unclobbered_vdso_oversubscribed: Test failed at step #2 > > # FAIL enclave.unclobbered_vdso_oversubscribed > > not ok 2 enclave.unclobbered_vdso_oversubscribed > > # RUN enclave.clobbered_vdso ... > > # OK enclave.clobbered_vdso > > ok 3 enclave.clobbered_vdso > > # RUN enclave.clobbered_vdso_and_user_function ... > > # OK enclave.clobbered_vdso_and_user_function > > ok 4 enclave.clobbered_vdso_and_user_function > > # RUN enclave.tcs_entry ... > > # OK enclave.tcs_entry > > ok 5 enclave.tcs_entry > > # RUN enclave.pte_permissions ... > > # OK enclave.pte_permissions > > > > The kernel logs also contain a splat that I have not encountered before: > > > > ------------[ cut here ]------------ > > ELDU returned 9 (0x9) > > WARNING: CPU: 6 PID: 2470 at arch/x86/kernel/cpu/sgx/encl.c:77 sgx_encl_eldu+0x37c/0x3f0 > > Modules linked in: intel_rapl_msr intel_rapl_common i10nm_edac x86_pkg_temp_thermal ipmi_ssif coretemp kvm_intel kvm cmdlinepart intel_spi_pci intel_spi spi_nor ipmi_si mei_me ipmi_devintf input_leds irqbypass mtd mei ioatdma intel_pch_thermal wmi ipmi_msghandler acpi_power_meter iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 raid0 multipath linear ixgbe crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel crypto_simd xfrm_algo usbhid cryptd ast dca hid mdio drm_vram_helper drm_ttm_helper > > CPU: 6 PID: 2470 Comm: test_sgx Not tainted 5.16.0-rc1+ #24 > > Hardware name: Intel Corporation > > RIP: 0010:sgx_encl_eldu+0x37c/0x3f0 > > Code: 89 c2 48 c7 c6 e1 e9 3e 9b 48 c7 c7 e6 e9 3e 9b 44 89 95 54 ff ff ff 4c 89 85 58 ff ff ff c6 05 fc bd dd 01 01 e8 54 88 03 00 <0f> 0b 44 8b 95 54 ff ff ff 4c 8b 85 58 ff ff ff e9 46 fe ff ff 48 > > <snip> > > Call Trace: > > <TASK> > > sgx_encl_load_page+0x82/0xc0 > > ? sgx_encl_load_page+0x82/0xc0 > > sgx_vma_fault+0x40/0xe0 > > __do_fault+0x32/0x110 > > __handle_mm_fault+0xf84/0x1510 > > handle_mm_fault+0x13e/0x3f0 > > do_user_addr_fault+0x210/0x660 > > ? rcu_read_lock_sched_held+0x4f/0x80 > > exc_page_fault+0x7b/0x270 > > ? asm_exc_page_fault+0x8/0x30 > > asm_exc_page_fault+0x1e/0x30 > > RIP: 0033:0x7ffe7fdc3dba > > <snip> > > > > I ran the test on two systems and in both cases the test failed accompanied by > > the kernel splat. > > > > Reinette > > Thank you for testing this. > > I did not get any errors when I run kselftest at the time *but* it was > exactly two months ago (2021-11-11). I cannot recall whether this test > was already in at the time, or did I run the overcommit test out-of-tree, > or if some confliciting non-kselftest changes have been applied. > > I'll do the backtracking when I have the time by doing git bisect between > 2021-11-11 x86/sgx and the current one. Yep, I think I get the exact same result with tip/x86/sgx. $ ./test_sgx TAP version 13 1..6 # Starting 6 tests from 2 test cases. # RUN enclave.unclobbered_vdso ... # OK enclave.unclobbered_vdso ok 1 enclave.unclobbered_vdso # RUN enclave.unclobbered_vdso_oversubscribed ... # main.c:330:unclobbered_vdso_oversubscribed:Expected (&self->run)->function (2) == EEXIT (4) # main.c:330:unclobbered_vdso_oversubscribed:0x0e 0x06 0x00007f3160000fff # main.c:338:unclobbered_vdso_oversubscribed:Expected get_op.value (0) == MAGIC (1234605616436508552) # main.c:339:unclobbered_vdso_oversubscribed:Expected (&self->run)->function (2) == EEXIT (4) # main.c:339:unclobbered_vdso_oversubscribed:0x0e 0x06 0x00007f3160000fff # unclobbered_vdso_oversubscribed: Test failed at step #2 # FAIL enclave.unclobbered_vdso_oversubscribed not ok 2 enclave.unclobbered_vdso_oversubscribed # RUN enclave.clobbered_vdso ... # OK enclave.clobbered_vdso ok 3 enclave.clobbered_vdso # RUN enclave.clobbered_vdso_and_user_function ... # OK enclave.clobbered_vdso_and_user_function ok 4 enclave.clobbered_vdso_and_user_function # RUN enclave.tcs_entry ... # OK enclave.tcs_entry ok 5 enclave.tcs_entry # RUN enclave.pte_permissions ... # OK enclave.pte_permissions ok 6 enclave.pte_permissions # FAILED: 5 / 6 tests passed. # Totals: pass:5 fail:1 xfail:0 xpass:0 skip:0 error:0 And dmesg output: [ 4267.158920] ------------[ cut here ]------------ [ 4267.158923] ELDU returned 9 (0x9) [ 4267.158936] WARNING: CPU: 1 PID: 1343 at arch/x86/kernel/cpu/sgx/encl.c:77 sgx_encl_eldu+0x3f7/0x420 [ 4267.158945] Modules linked in: cfg80211 rfkill ccm algif_aead des_generic libdes ecb algif_skcipher cmac md4 algif_hash af_alg intel_rapl_msr intel_rapl_common kvm_intel kvm snd_hda_codec_generic ledtrig_audio irqbypass snd_hda_intel rapl snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec iTCO_wdt intel_pmc_bxt iTCO_vendor_support snd_hda_core psmouse vfat snd_hwdep fat snd_pcm intel_agp i2c_i801 intel_gtt pcspkr joydev i2c_smbus snd_timer agpgart mousedev ext4 snd lpc_ich soundcore mac_hid qemu_fw_cfg crc16 mbcache jbd2 pkcs8_key_parser fuse ip_tables x_tables btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq dm_crypt cbc encrypted_keys trusted asn1_encoder tee usbhid dm_mod virtio_gpu virtio_dma_buf drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops virtio_rng cec virtio_balloon virtio_blk virtio_console drm virtio_net net_failover failover crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd tpm_crb tpm_tis serio_raw [ 4267.159078] tpm_tis_core xhci_pci xhci_pci_renesas tpm virtio_pci virtio_pci_legacy_dev virtio_pci_modern_dev rng_core [ 4267.159089] CPU: 1 PID: 1343 Comm: test_sgx Not tainted 5.16.0-rc1-1-sgx-g142746670045 #1 [ 4267.159092] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 [ 4267.159095] RIP: 0010:sgx_encl_eldu+0x3f7/0x420 [ 4267.159098] Code: ff ff ff 89 c1 89 c2 48 c7 c6 03 06 d6 9c 4c 89 0c 24 48 c7 c7 08 06 d6 9c 44 89 54 24 08 c6 05 5f 45 b6 01 01 e8 d2 78 a6 00 <0f> 0b 4c 8b 0c 24 44 8b 54 24 08 e9 36 fe ff ff e8 5b 8b fa ff e9 [ 4267.159099] RSP: 0000:ffffb5314074bcc0 EFLAGS: 00010282 [ 4267.159100] RAX: 0000000000000000 RBX: ffffb5314074bce0 RCX: 0000000000000027 [ 4267.159101] RDX: ffff8bd9f7d20728 RSI: 0000000000000001 RDI: ffff8bd9f7d20720 [ 4267.159102] RBP: ffffb5314074bd70 R08: 0000000000000000 R09: ffffb5314074baf0 [ 4267.159103] R10: ffffb5314074bae8 R11: ffffffff9d4cbd28 R12: ffff8bd9a3ca18c0 [ 4267.159104] R13: 0000000000000000 R14: ffff8bd882139000 R15: ffffb531403a9420 [ 4267.159105] FS: 00007f3163e48c00(0000) GS:ffff8bd9f7d00000(0000) knlGS:0000000000000000 [ 4267.159106] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4267.159107] CR2: 00007f3160000fff CR3: 00000001a2046001 CR4: 0000000000370ee0 [ 4267.159111] Call Trace: [ 4267.159116] <TASK> [ 4267.159124] sgx_encl_load_page+0x73/0xb0 [ 4267.159126] sgx_vma_fault+0x3a/0xd0 [ 4267.159127] __do_fault+0x36/0xd0 [ 4267.159132] __handle_mm_fault+0xd4e/0x1540 [ 4267.159135] handle_mm_fault+0xb2/0x280 [ 4267.159137] do_user_addr_fault+0x1ba/0x690 [ 4267.159140] exc_page_fault+0x72/0x170 [ 4267.159144] ? asm_exc_page_fault+0x8/0x30 [ 4267.159147] asm_exc_page_fault+0x1e/0x30 [ 4267.159150] RIP: 0033:0x7ffe771c4c8a [ 4267.159153] Code: 43 48 8b 4d 10 48 c7 c3 28 00 00 00 48 83 3c 19 00 75 31 48 83 c3 08 48 81 fb 00 01 00 00 75 ec 48 8b 19 48 8d 0d 00 00 00 00 <0f> 01 d7 48 8b 5d 10 c7 43 08 04 00 00 00 48 83 7b 18 00 75 21 31 [ 4267.159154] RSP: 002b:00007ffe7719ce88 EFLAGS: 00010246 [ 4267.159155] RAX: 0000000000000002 RBX: 00007f3160000000 RCX: 00007ffe771c4c8a [ 4267.159156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00007ffe7719d020 [ 4267.159157] RBP: 00007ffe7719ce90 R08: 0000000000000000 R09: 0000000000000000 [ 4267.159158] R10: 00007f3163e86bb0 R11: 00007f3163fcd9f0 R12: 000055c4ac58a410 [ 4267.159158] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [ 4267.159162] </TASK> [ 4267.159163] ---[ end trace 1b44544248db3939 ]--- [ 5395.692320] audit: type=1100 audit(1642407877.716:143): pid=1397 uid=1000 auid=1000 ses=3 msg='op=PAM:authentication grantors=? acct="jarkko" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=failed' [ 5400.216126] audit: type=1100 audit(1642407882.243:144): pid=1397 uid=1000 auid=1000 ses=3 msg='op=PAM:authentication grantors=pam_faillock,pam_permit,pam_faillock acct="jarkko" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success' [ 5400.216289] audit: type=1101 audit(1642407882.243:145): pid=1397 uid=1000 auid=1000 ses=3 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="jarkko" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success' [ 5400.217010] audit: type=1110 audit(1642407882.243:146): pid=1397 uid=1000 auid=1000 ses=3 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success' [ 5400.219633] audit: type=1105 audit(1642407882.246:147): pid=1397 uid=1000 auid=1000 ses=3 msg='op=PAM:session_open grantors=pam_systemd_home,pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/0 res=success' This was run in QEMU. /Jarkko