Sean Christopherson <seanjc@xxxxxxxxxx> writes: > On Thu, Feb 29, 2024, David Stevens wrote: >> From: David Stevens <stevensd@xxxxxxxxxxxx> >> >> This patch series adds support for mapping VM_IO and VM_PFNMAP memory >> that is backed by struct pages that aren't currently being refcounted >> (e.g. tail pages of non-compound higher order allocations) into the >> guest. >> >> Our use case is virtio-gpu blob resources [1], which directly map host >> graphics buffers into the guest as "vram" for the virtio-gpu device. >> This feature currently does not work on systems using the amdgpu driver, >> as that driver allocates non-compound higher order pages via >> ttm_pool_alloc_page(). >> >> First, this series replaces the gfn_to_pfn_memslot() API with a more >> extensible kvm_follow_pfn() API. The updated API rearranges >> gfn_to_pfn_memslot()'s args into a struct and where possible packs the >> bool arguments into a FOLL_ flags argument. The refactoring changes do >> not change any behavior. >> >> From there, this series extends the kvm_follow_pfn() API so that >> non-refconuted pages can be safely handled. This invloves adding an >> input parameter to indicate whether the caller can safely use >> non-refcounted pfns and an output parameter to tell the caller whether >> or not the returned page is refcounted. This change includes a breaking >> change, by disallowing non-refcounted pfn mappings by default, as such >> mappings are unsafe. To allow such systems to continue to function, an >> opt-in module parameter is added to allow the unsafe behavior. >> >> This series only adds support for non-refcounted pages to x86. Other >> MMUs can likely be updated without too much difficulty, but it is not >> needed at this point. Updating other parts of KVM (e.g. pfncache) is not >> straightforward [2]. > > FYI, on the off chance that someone else is eyeballing this, I am working on > revamping this series. It's still a ways out, but I'm optimistic that we'll be > able to address the concerns raised by Christoph and Christian, and maybe even > get KVM out of the weeds straightaway (PPC looks thorny :-/). I've applied this series to the latest 6.9.x while attempting to diagnose some of the virtio-gpu problems it may or may not address. However launching KVM guests keeps triggering a bunch of BUGs that eventually leave a hung guest: 12:16:54 [root@draig:~] # dmesg -c [252080.141629] RAX: ffffffffffffffda RBX: 0000560a64915500 RCX: 00007faa23e81c5b [252080.141629] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000017 [252080.141630] RBP: 000000000000ae80 R08: 0000000000000000 R09: 0000000000000000 [252080.141630] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [252080.141631] R13: 0000000000000001 R14: 00000000000000b2 R15: 0000000000000002 [252080.141632] </TASK> [252080.141632] BUG: Bad page state in process CPU 0/KVM pfn:fb1665 [252080.141633] page: refcount:0 mapcount:1 mapping:0000000000000000 index:0x7fa8117c3 pfn:0xfb1665 [252080.141633] flags: 0x17ffffc00a000c(referenced|uptodate|mappedtodisk|swapbacked|node=0|zone=2|lastcpupid=0x1fffff) [252080.141634] page_type: 0x0() [252080.141635] raw: 0017ffffc00a000c dead000000000100 dead000000000122 0000000000000000 [252080.141635] raw: 00000007fa8117c3 0000000000000000 0000000000000000 0000000000000000 [252080.141635] page dumped because: nonzero mapcount [252080.141636] Modules linked in: vhost_net vhost vhost_iotlb tap tun uas usb_storage veth cfg80211 nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter nft_ma sq wireguard libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel rfcomm snd_seq_dummy snd_hrtimer s nd_seq xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetl ink bridge stp llc qrtr overlay cmac algif_hash algif_skcipher af_alg bnep binfmt_misc squashfs snd_hda_codec_hdmi intel_uncore_frequency snd_ctl_led intel_uncore_frequency_ common ledtrig_audio x86_pkg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_tgl snd_sof_intel_hda_common kvm_intel soundwire_intel soundwire_generic_allocation btu sb snd_sof_intel_hda_mlink sd_mod soundwire_cadence btrtl snd_hda_codec_realtek kvm sg snd_sof_intel_hda btintel snd_sof_pci btbcm snd_hda_codec_generic btmtk [252080.141656] snd_sof_xtensa_dsp crc32_pclmul bluetooth snd_hda_scodec_component ghash_clmulni_intel snd_sof sha256_ssse3 sha1_ssse3 snd_sof_utils snd_soc_hdac_hda snd_hd a_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress soundwire_bus sha3_generic jitterentropy_rng aesni_intel snd_hda_intel snd_intel_dspcfg crypto_sim d sha512_ssse3 snd_intel_sdw_acpi cryptd sha512_generic uvcvideo snd_hda_codec snd_usb_audio videobuf2_vmalloc uvc ctr videobuf2_memops snd_hda_core snd_usbmidi_lib videobuf 2_v4l2 snd_rawmidi drbg snd_hwdep dell_wmi snd_seq_device nls_ascii ahci ansi_cprng iTCO_wdt processor_thermal_device_pci videodev nls_cp437 snd_pcm intel_pmc_bxt dell_smbio s libahci processor_thermal_device rapl rtsx_pci_sdmmc iTCO_vendor_support ecdh_generic mmc_core mei_hdcp watchdog libata intel_rapl_msr videobuf2_common rfkill vfat process or_thermal_wt_hint pl2303 snd_timer dcdbas dell_wmi_ddv dell_wmi_sysman processor_thermal_rfim ucsi_acpi fat intel_cstate usbserial intel_uncore cdc_acm mc battery ecc [252080.141670] firmware_attributes_class dell_wmi_descriptor wmi_bmof dell_smm_hwmon processor_thermal_rapl pcspkr scsi_mod mei_me intel_lpss_pci snd typec_ucsi igc e1000e i2c_i801 rtsx_pci intel_rapl_common intel_lpss roles mei soundcore processor_thermal_wt_req i2c_smbus idma64 scsi_common processor_thermal_power_floor typec processor_therm al_mbox button intel_pmc_core int3403_thermal int340x_thermal_zone intel_vsec pmt_telemetry intel_hid int3400_thermal pmt_class sparse_keymap acpi_tad acpi_pad acpi_thermal_ rel msr parport_pc ppdev lp parport fuse loop efi_pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 hid_microsoft joydev ff_memless hid_generic usb hid hid btrfs blake2b_generic libcrc32c crc32c_generic xor raid6_pq evdev dm_mod i915 i2c_algo_bit drm_buddy ttm drm_display_helper xhci_pci xhci_hcd drm_kms_helper nvme nvm e_core drm t10_pi usbcore video crc64_rocksoft crc64 crc_t10dif cec crct10dif_generic crct10dif_pclmul crc32c_intel rc_core usb_common crct10dif_common wmi [252080.141686] pinctrl_alderlake [252080.141686] CPU: 8 PID: 1819169 Comm: CPU 0/KVM Tainted: G B W 6.9.12-ajb-00008-gfcd4b7efbad0 #17 [252080.141687] Hardware name: Dell Inc. Precision 3660/0PRR48, BIOS 2.8.1 08/14/2023 [252080.141688] Call Trace: [252080.141688] <TASK> [252080.141688] dump_stack_lvl+0x60/0x80 [252080.141689] bad_page+0x70/0x100 [252080.141690] free_unref_page_prepare+0x22a/0x370 [252080.141692] free_unref_folios+0xe5/0x340 [252080.141693] ? __mem_cgroup_uncharge_folios+0x7a/0xa0 [252080.141694] folios_put_refs+0x147/0x1e0 [252080.141696] ? __pfx_lru_add_fn+0x10/0x10 [252080.141697] folio_batch_move_lru+0xc8/0x140 [252080.141699] folio_add_lru+0x51/0xa0 [252080.141700] do_wp_page+0x4dd/0xb60 [252080.141701] __handle_mm_fault+0xb2a/0xe30 [252080.141703] handle_mm_fault+0x18c/0x320 [252080.141704] __get_user_pages+0x164/0x6f0 [252080.141705] get_user_pages_unlocked+0xe2/0x370 [252080.141706] hva_to_pfn+0xa0/0x740 [kvm] [252080.141724] kvm_faultin_pfn+0xf3/0x5f0 [kvm] [252080.141750] kvm_tdp_page_fault+0x100/0x150 [kvm] [252080.141774] kvm_mmu_page_fault+0x27e/0x7f0 [kvm] [252080.141798] ? em_rsm+0xad/0x170 [kvm] [252080.141823] ? writeback_registers+0x44/0x80 [kvm] [252080.141848] ? vmx_set_cr0+0xc7/0x1320 [kvm_intel] [252080.141853] ? x86_emulate_insn+0x484/0xe60 [kvm] [252080.141877] ? vmx_vmexit+0x6e/0xd0 [kvm_intel] [252080.141882] ? vmx_vmexit+0x99/0xd0 [kvm_intel] [252080.141887] vmx_handle_exit+0x129/0x930 [kvm_intel] [252080.141892] kvm_arch_vcpu_ioctl_run+0x682/0x15b0 [kvm] [252080.141918] kvm_vcpu_ioctl+0x23d/0x6f0 [kvm] [252080.141936] ? __seccomp_filter+0x32f/0x500 [252080.141937] ? kvm_io_bus_read+0x42/0xd0 [kvm] [252080.141956] __x64_sys_ioctl+0x90/0xd0 [252080.141957] do_syscall_64+0x80/0x190 [252080.141958] ? kvm_arch_vcpu_put+0x126/0x160 [kvm] [252080.141982] ? vcpu_put+0x1e/0x50 [kvm] [252080.141999] ? kvm_arch_vcpu_ioctl_run+0x757/0x15b0 [kvm] [252080.142023] ? kvm_vcpu_ioctl+0x29e/0x6f0 [kvm] [252080.142040] ? __seccomp_filter+0x32f/0x500 [252080.142042] ? kvm_on_user_return+0x60/0x90 [kvm] [252080.142065] ? fire_user_return_notifiers+0x30/0x60 [252080.142066] ? syscall_exit_to_user_mode+0x73/0x200 [252080.142067] ? do_syscall_64+0x8c/0x190 [252080.142068] ? kvm_on_user_return+0x60/0x90 [kvm] [252080.142090] ? fire_user_return_notifiers+0x30/0x60 [252080.142091] ? syscall_exit_to_user_mode+0x73/0x200 [252080.142092] ? do_syscall_64+0x8c/0x190 [252080.142093] ? do_syscall_64+0x8c/0x190 [252080.142094] ? do_syscall_64+0x8c/0x190 [252080.142095] ? exc_page_fault+0x72/0x170 [252080.142096] entry_SYSCALL_64_after_hwframe+0x76/0x7e This backtrace repeats for a large chunk of pfns -- Alex Bennée Virtualisation Tech Lead @ Linaro