Re: [PATCH v9 1/8] mm: Introduce memfd_restricted system call to create restricted user memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 25, 2022 at 11:13:37PM +0800, Chao Peng wrote:
> From: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx>
> 

<snip>

> +static struct file *restrictedmem_file_create(struct file *memfd)
> +{
> +	struct restrictedmem_data *data;
> +	struct address_space *mapping;
> +	struct inode *inode;
> +	struct file *file;
> +
> +	data = kzalloc(sizeof(*data), GFP_KERNEL);
> +	if (!data)
> +		return ERR_PTR(-ENOMEM);
> +
> +	data->memfd = memfd;
> +	mutex_init(&data->lock);
> +	INIT_LIST_HEAD(&data->notifiers);
> +
> +	inode = alloc_anon_inode(restrictedmem_mnt->mnt_sb);
> +	if (IS_ERR(inode)) {
> +		kfree(data);
> +		return ERR_CAST(inode);
> +	}
> +
> +	inode->i_mode |= S_IFREG;
> +	inode->i_op = &restrictedmem_iops;
> +	inode->i_mapping->private_data = data;
> +
> +	file = alloc_file_pseudo(inode, restrictedmem_mnt,
> +				 "restrictedmem", O_RDWR,
> +				 &restrictedmem_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		kfree(data);
> +		return ERR_CAST(file);
> +	}
> +
> +	file->f_flags |= O_LARGEFILE;
> +
> +	mapping = memfd->f_mapping;
> +	mapping_set_unevictable(mapping);
> +	mapping_set_gfp_mask(mapping,
> +			     mapping_gfp_mask(mapping) & ~__GFP_MOVABLE);

Is this supposed to prevent migration of pages being used for
restrictedmem/shmem backend?

In my case I've been testing SNP support based on UPM v9, and for
large guests (128GB+), if I force 2M THPs via:

  echo always >/sys/kernel/mm/transparent_hugepages/shmem_enabled

it will in some cases trigger the below trace, which suggests that
kcompactd is trying to call migrate_folio() on a PFN that was/is
still allocated for guest private memory (and so has been removed from
directmap as part of shared->private conversation via REG_REGION kvm
ioctl, leading to the crash). This trace seems to occur during early
OVMF boot while the guest is in the middle of pre-accepting on private
memory (no lazy accept in this case).

Is this expected behavior? What else needs to be done to ensure
migrations aren't attempted in this case?

Thanks!

-Mike


# Host logs with debug info for crash during SNP boot

...
[  904.373632] kvm_restricted_mem_get_pfn: GFN: 0x1caced1, PFN: 0x156b7f, page: ffffea0006b197b0, ref_count: 2
[  904.373634] kvm_restricted_mem_get_pfn: GFN: 0x1caced2, PFN: 0x156840, page: ffffea0006b09400, ref_count: 2
[  904.373637] kvm_restricted_mem_get_pfn: GFN: 0x1caced3, PFN: 0x156841, page: ffffea0006b09450, ref_count: 2
[  904.373639] kvm_restricted_mem_get_pfn: GFN: 0x1caced4, PFN: 0x156842, page: ffffea0006b094a0, ref_count: 2
[  904.373641] kvm_restricted_mem_get_pfn: GFN: 0x1caced5, PFN: 0x156843, page: ffffea0006b094f0, ref_count: 2
[  904.373645] kvm_restricted_mem_get_pfn: GFN: 0x1caced6, PFN: 0x156844, page: ffffea0006b09540, ref_count: 2
[  904.373647] kvm_restricted_mem_get_pfn: GFN: 0x1caced7, PFN: 0x156845, page: ffffea0006b09590, ref_count: 2
[  904.373649] kvm_restricted_mem_get_pfn: GFN: 0x1caced8, PFN: 0x156846, page: ffffea0006b095e0, ref_count: 2
[  904.373652] kvm_restricted_mem_get_pfn: GFN: 0x1caced9, PFN: 0x156847, page: ffffea0006b09630, ref_count: 2
[  904.373654] kvm_restricted_mem_get_pfn: GFN: 0x1caceda, PFN: 0x156848, page: ffffea0006b09680, ref_count: 2
[  904.373656] kvm_restricted_mem_get_pfn: GFN: 0x1cacedb, PFN: 0x156849, page: ffffea0006b096d0, ref_count: 2
[  904.373661] kvm_restricted_mem_get_pfn: GFN: 0x1cacedc, PFN: 0x15684a, page: ffffea0006b09720, ref_count: 2
[  904.373663] kvm_restricted_mem_get_pfn: GFN: 0x1cacedd, PFN: 0x15684b, page: ffffea0006b09770, ref_count: 2

# PFN 0x15684c is allocated for guest private memory, will have been removed from directmap as part of RMP requirements

[  904.373665] kvm_restricted_mem_get_pfn: GFN: 0x1cacede, PFN: 0x15684c, page: ffffea0006b097c0, ref_count: 2
...

# kcompactd crashes trying to copy PFN 0x15684c to a new folio, crashes trying to access PFN via directmap

[  904.470135] Migrating restricted page, SRC pfn: 0x15684c, folio_ref_count: 2, folio_order: 0
[  904.470154] BUG: unable to handle page fault for address: ffff88815684c000
[  904.470314] kvm_restricted_mem_get_pfn: GFN: 0x1cafe00, PFN: 0x19f6d0, page: ffffea00081d2100, ref_count: 2
[  904.477828] #PF: supervisor read access in kernel mode
[  904.477831] #PF: error_code(0x0000) - not-present page
[  904.477833] PGD 6601067 P4D 6601067 PUD 1569ad063 PMD 1569af063 PTE 800ffffea97b3060
[  904.508806] Oops: 0000 [#1] SMP NOPTI
[  904.512892] CPU: 52 PID: 1563 Comm: kcompactd0 Tainted: G            E      6.0.0-rc7-hsnp-v7pfdv9d+ #10
[  904.523473] Hardware name: AMD Corporation ETHANOL_X/ETHANOL_X, BIOS RXM1006B 08/20/2021
[  904.532499] RIP: 0010:copy_page+0x7/0x10
[  904.536877] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[  904.557831] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[  904.563661] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[  904.571622] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[  904.579581] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[  904.587541] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[  904.595502] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[  904.603462] FS:  0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[  904.612489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  904.618897] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[  904.626855] PKRU: 55555554
[  904.629870] Call Trace:
[  904.632594]  <TASK>
[  904.634928]  ? folio_copy+0x8c/0xe0
[  904.638818]  migrate_folio+0x5b/0x110
[  904.642901]  move_to_new_folio+0x5b/0x150
[  904.647371]  migrate_pages+0x11bb/0x1830
[  904.651743]  ? move_freelist_tail+0xc0/0xc0
[  904.656406]  ? isolate_freepages_block+0x470/0x470
[  904.661749]  compact_zone+0x681/0xda0
[  904.665832]  kcompactd_do_work+0x1b3/0x2c0
[  904.670400]  kcompactd+0x257/0x330
[  904.674190]  ? prepare_to_wait_event+0x120/0x120
[  904.679338]  ? kcompactd_do_work+0x2c0/0x2c0
[  904.684098]  kthread+0xcf/0xf0
[  904.687501]  ? kthread_complete_and_exit+0x20/0x20
[  904.692844]  ret_from_fork+0x22/0x30
[  904.696830]  </TASK>
[  904.699262] Modules linked in: nf_conntrack_netlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) xt_CHECKSUM(E) xt_MASQUERADE(E) xt_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) ip6table_mangle(E) ip6table_nat(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) nf_tables(E) nfnetlink(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) bpfilter(E) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) bridge(E) stp(E) llc(E) kvm_amd(E) overlay(E) nls_iso8859_1(E) kvm(E) crct10dif_pclmul(E) ghash_clmulni_intel(E) aesni_intel(E) crypto_simd(E) cryptd(E) rapl(E) ipmi_si(E) ipmi_devintf(E) wmi_bmof(E) ipmi_msghandler(E) efi_pstore(E) binfmt_misc(E) ast(E) drm_vram_helper(E) joydev(E) drm_ttm_helper(E) ttm(E) drm_kms_helper(E) input_leds(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ccp(E) k10temp(E) mac_hid(E) sch_fq_codel(E) parport_pc(E) ppdev(E) lp(E) parport(E) drm(E) ip_tables(E)
[  904.699316]  x_tables(E) autofs4(E) btrfs(E) blake2b_generic(E) zstd_compress(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) raid1(E) raid0(E) multipath(E) linear(E) crc32_pclmul(E) hid_generic(E) usbhid(E) hid(E) e1000e(E) i2c_piix4(E) wmi(E)
[  904.828498] CR2: ffff88815684c000
[  904.832193] ---[ end trace 0000000000000000 ]---
[  904.937159] RIP: 0010:copy_page+0x7/0x10
[  904.941524] Code: 00 66 90 48 89 f8 48 89 d1 f3 a4 31 c0 c3 cc cc cc cc 48 89 c8 c3 cc cc cc cc cc cc cc cc cc cc cc cc cc 66 90 b9 00 02 00 00 <f3> 48 a5 c3 cc cc cc cc 90 48 83 ec 10 48 89 1c 24 4c 89 64 24 08
[  904.962478] RSP: 0018:ffffc900106dfb78 EFLAGS: 00010286
[  904.968305] RAX: ffff888000000000 RBX: ffffea0006b09810 RCX: 0000000000000200
[  904.976265] RDX: ffffea0000000000 RSI: ffff88815684c000 RDI: ffff88816bc5d000
[  904.984227] RBP: ffffc900106dfba0 R08: 0000000000000001 R09: ffffea0006b097c0
[  904.992187] R10: 0000000000000002 R11: ffffc900106dfb38 R12: ffffea00071add60
[  905.000145] R13: cccccccccccccccd R14: ffffea0006b09810 R15: ffff888159c1e0f8
[  905.008105] FS:  0000000000000000(0000) GS:ffff88a04df00000(0000) knlGS:0000000000000000
[  905.017132] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  905.023540] CR2: ffff88815684c000 CR3: 00000020eae16002 CR4: 0000000000770ee0
[  905.031501] PKRU: 55555554
[  905.034558] kvm_restricted_mem_get_pfn: GFN: 0x1cafe01, PFN: 0x19f6d1, page: ffffea00081d2150, ref_count: 2
[  905.045455] kvm_restricted_mem_get_pfn: GFN: 0x1cafe02, PFN: 0x19f6d2, page: ffffea00081d21a0, ref_count: 2
...




[Index of Archives]     [Linux Kernel]     [Kernel Newbies]     [x86 Platform Driver]     [Netdev]     [Linux Wireless]     [Netfilter]     [Bugtraq]     [Linux Filesystems]     [Yosemite Discussion]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Device Mapper]

  Powered by Linux