On 8/1/24 14:41, Will Deacon wrote: > On Wed, Jul 31, 2024 at 09:18:56AM -0700, Sean Christopherson wrote: >> [...] >> Ya, the basic problem is that we have two ways of publishing the vCPU, fd and >> vcpu_array, with no way of setting both atomically. Given that xa_store() should >> never fail, I vote we do the simple thing and deliberately leak the memory. > > I'm inclined to agree. This conversation did momentarily get me worried > about the window between the successful create_vcpu_fd() and the > xa_store(), but it looks like 'kvm->online_vcpus' protects that. > > I'll spin a v2 leaking the vCPU, then. But perhaps you're right. The window you've described may be an issue. For example: static u64 get_time_ref_counter(struct kvm *kvm) { ... vcpu = kvm_get_vcpu(kvm, 0); // may still be NULL tsc = kvm_read_l1_tsc(vcpu, rdtsc()); return mul_u64_u64_shr(tsc, hv->tsc_ref.tsc_scale, 64) + hv->tsc_ref.tsc_offset; } u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc) { return vcpu->arch.l1_tsc_offset + kvm_scale_tsc(host_tsc, vcpu->arch.l1_tsc_scaling_ratio); } After stuffing msleep() between fd install and vcpu_array store: [ 125.296110] BUG: kernel NULL pointer dereference, address: 0000000000000b38 [ 125.296203] #PF: supervisor read access in kernel mode [ 125.296266] #PF: error_code(0x0000) - not-present page [ 125.296327] PGD 12539e067 P4D 12539e067 PUD 12539d067 PMD 0 [ 125.296392] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 125.296454] CPU: 12 UID: 1000 PID: 1179 Comm: a.out Not tainted 6.11.0-rc1nokasan+ #19 [ 125.296521] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 125.296585] RIP: 0010:kvm_read_l1_tsc+0x6/0x50 [kvm] [ 125.297376] Call Trace: [ 125.297430] <TASK> [ 125.297919] get_time_ref_counter+0x70/0x90 [kvm] [ 125.298039] kvm_hv_get_msr_common+0xc1/0x7d0 [kvm] [ 125.298150] __kvm_get_msr+0x72/0xf0 [kvm] [ 125.298421] do_get_msr+0x16/0x50 [kvm] [ 125.298531] msr_io+0x9d/0x110 [kvm] [ 125.298626] kvm_arch_vcpu_ioctl+0xdc5/0x19c0 [kvm] [ 125.299345] kvm_vcpu_ioctl+0x6cc/0x920 [kvm] [ 125.299540] __x64_sys_ioctl+0x90/0xd0 [ 125.299582] do_syscall_64+0x93/0x180 [ 125.300206] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 125.300243] RIP: 0033:0x7f2d64aded2d So, is get_time_ref_counter() broken (with a trivial fix) or should it be considered a regression after commit afb2acb2e3a3 ("KVM: Fix vcpu_array[0] races")? Note that KASAN build, after null ptr oops, reports: [ 3528.449742] BUG: KASAN: slab-use-after-free in mutex_can_spin_on_owner+0x18b/0x1b0 [ 3528.449884] Read of size 4 at addr ffff88814a040034 by task a.out/1240 [ 3528.450135] CPU: 6 UID: 1000 PID: 1240 Comm: a.out Tainted: G D 6.11.0-rc1+ #20 [ 3528.450289] Tainted: [D]=DIE [ 3528.450412] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014 [ 3528.450551] Call Trace: [ 3528.450677] <TASK> [ 3528.450802] dump_stack_lvl+0x68/0x90 [ 3528.450940] print_report+0x174/0x4f6 [ 3528.451074] ? __virt_addr_valid+0x208/0x410 [ 3528.451205] ? mutex_can_spin_on_owner+0x18b/0x1b0 [ 3528.451337] kasan_report+0xb9/0x190 [ 3528.451469] ? mutex_can_spin_on_owner+0x18b/0x1b0 [ 3528.451606] mutex_can_spin_on_owner+0x18b/0x1b0 [ 3528.451737] __mutex_lock+0x1e3/0x1010 [ 3528.451871] ? kvm_arch_vcpu_postcreate+0x3c/0x150 [kvm] [ 3528.452321] ? __pfx___mutex_lock+0x10/0x10 [ 3528.452456] ? __pfx_lock_release+0x10/0x10 [ 3528.452642] ? __pfx___mutex_unlock_slowpath+0x10/0x10 [ 3528.452794] ? __pfx_do_raw_spin_lock+0x10/0x10 [ 3528.452928] ? kvm_arch_vcpu_postcreate+0x3c/0x150 [kvm] [ 3528.453303] kvm_arch_vcpu_postcreate+0x3c/0x150 [kvm] [ 3528.453663] kvm_vm_ioctl+0x1b73/0x21b0 [kvm] [ 3528.454025] ? mark_lock+0xe2/0x1530 [ 3528.454160] ? __pfx_kvm_vm_ioctl+0x10/0x10 [kvm] [ 3528.454543] ? __pfx_mark_lock+0x10/0x10 [ 3528.454686] ? __pfx_lock_release+0x10/0x10 [ 3528.454826] ? schedule+0xe8/0x3b0 [ 3528.454970] ? __lock_acquire+0xd68/0x5e20 [ 3528.455114] ? futex_wait_setup+0xb2/0x190 [ 3528.455252] ? __entry_text_end+0x1543/0x10260d [ 3528.455385] ? __pfx___lock_acquire+0x10/0x10 [ 3528.455542] ? __pfx_futex_wake_mark+0x10/0x10 [ 3528.455676] ? __pfx_do_vfs_ioctl+0x10/0x10 [ 3528.455826] ? find_held_lock+0x2d/0x110 [ 3528.455959] ? lock_release+0x44b/0x770 [ 3528.456090] ? __pfx_futex_wait+0x10/0x10 [ 3528.456222] ? __pfx_ioctl_has_perm.constprop.0.isra.0+0x10/0x10 [ 3528.456368] ? __fget_files+0x1d6/0x340 [ 3528.456508] __x64_sys_ioctl+0x130/0x1a0 [ 3528.456641] do_syscall_64+0x93/0x180 [ 3528.456772] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 3528.456907] ? do_syscall_64+0x9f/0x180 [ 3528.457034] ? lockdep_hardirqs_on+0x78/0x100 [ 3528.457164] ? do_syscall_64+0x9f/0x180 [ 3528.457292] ? lock_release+0x44b/0x770 [ 3528.457425] ? __pfx_lock_release+0x10/0x10 [ 3528.457560] ? lockdep_hardirqs_on_prepare+0x16d/0x400 [ 3528.457694] ? do_syscall_64+0x9f/0x180 [ 3528.457821] ? lockdep_hardirqs_on+0x78/0x100 [ 3528.457950] ? do_syscall_64+0x9f/0x180 [ 3528.458081] ? clear_bhb_loop+0x45/0xa0 [ 3528.458210] ? clear_bhb_loop+0x45/0xa0 [ 3528.458341] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3528.458475] RIP: 0033:0x7f2457d4ed2d [ 3528.458609] Code: 04 25 28 00 00 00 48 89 45 c8 31 c0 48 8d 45 10 c7 45 b0 10 00 00 00 48 89 45 b8 48 8d 45 d0 48 89 45 c0 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 1a 48 8b 45 c8 64 48 2b 04 25 28 00 00 00 [ 3528.458783] RSP: 002b:00007ffd616de5d0 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 3528.458927] RAX: ffffffffffffffda RBX: 00007ffd616de778 RCX: 00007f2457d4ed2d [ 3528.459062] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 0000000000000004 [ 3528.459195] RBP: 00007ffd616de620 R08: 00000000004040c0 R09: 0000000000000001 [ 3528.459327] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 [ 3528.459459] R13: 0000000000000000 R14: 00007f2457e77000 R15: 0000000000403e00 [ 3528.459604] </TASK> [ 3528.459843] Allocated by task 1240: [ 3528.459968] kasan_save_stack+0x1e/0x40 [ 3528.460100] kasan_save_track+0x10/0x30 [ 3528.460230] __kasan_slab_alloc+0x85/0x90 [ 3528.460357] kmem_cache_alloc_node_noprof+0x12c/0x360 [ 3528.460488] copy_process+0x372/0x8470 [ 3528.460617] kernel_clone+0xa6/0x620 [ 3528.460744] __do_sys_clone3+0x109/0x140 [ 3528.460873] do_syscall_64+0x93/0x180 [ 3528.460998] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 3528.461246] Freed by task 0: [ 3528.461368] kasan_save_stack+0x1e/0x40 [ 3528.461499] kasan_save_track+0x10/0x30 [ 3528.461628] kasan_save_free_info+0x37/0x70 [ 3528.461757] poison_slab_object+0x109/0x180 [ 3528.461875] __kasan_slab_free+0x2e/0x50 [ 3528.461988] kmem_cache_free+0x17d/0x450 [ 3528.462108] delayed_put_task_struct+0x16a/0x1f0 [ 3528.462226] rcu_do_batch+0x368/0xd50 [ 3528.462342] rcu_core+0x6d5/0xb60 [ 3528.462458] handle_softirqs+0x1b4/0x770 [ 3528.462572] __irq_exit_rcu+0xbb/0x1c0 [ 3528.462683] irq_exit_rcu+0xa/0x30 [ 3528.462786] sysvec_apic_timer_interrupt+0x9d/0xc0 [ 3528.462891] asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 3528.463095] Last potentially related work creation: [ 3528.463198] kasan_save_stack+0x1e/0x40 [ 3528.463305] __kasan_record_aux_stack+0xad/0xc0 [ 3528.463411] __call_rcu_common.constprop.0+0xae/0xe80 [ 3528.463519] __schedule+0xfd8/0x5ee0 [ 3528.463622] schedule_idle+0x52/0x80 [ 3528.463725] do_idle+0x25e/0x3d0 [ 3528.463828] cpu_startup_entry+0x50/0x60 [ 3528.463925] start_secondary+0x201/0x280 [ 3528.464023] common_startup_64+0x13e/0x141 [ 3528.464209] Second to last potentially related work creation: [ 3528.464306] kasan_save_stack+0x1e/0x40 [ 3528.464396] __kasan_record_aux_stack+0xad/0xc0 [ 3528.464485] task_work_add+0x1bd/0x270 [ 3528.464574] sched_tick+0x2c0/0x9d0 [ 3528.464662] update_process_times+0xd5/0x130 [ 3528.464753] tick_nohz_handler+0x1ae/0x4a0 [ 3528.464839] __hrtimer_run_queues+0x164/0x880 [ 3528.464923] hrtimer_interrupt+0x2f1/0x7f0 [ 3528.465007] __sysvec_apic_timer_interrupt+0xfd/0x3d0 [ 3528.465091] sysvec_apic_timer_interrupt+0x98/0xc0 [ 3528.465173] asm_sysvec_apic_timer_interrupt+0x16/0x20 [ 3528.465352] The buggy address belongs to the object at ffff88814a040000 which belongs to the cache task_struct of size 13128 [ 3528.465457] The buggy address is located 52 bytes inside of freed 13128-byte region [ffff88814a040000, ffff88814a043348) [ 3528.465629] The buggy address belongs to the physical page: [ 3528.465712] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x14a040 [ 3528.465802] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 3528.465888] memcg:ffff88812c785601 [ 3528.465963] flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff) [ 3528.466045] page_type: 0xfdffffff(slab) [ 3528.466120] raw: 0017ffffc0000040 ffff8881002cea00 ffffea00041e4000 dead000000000004 [ 3528.466198] raw: 0000000000000000 0000000080020002 00000001fdffffff ffff88812c785601 [ 3528.466275] head: 0017ffffc0000040 ffff8881002cea00 ffffea00041e4000 dead000000000004 [ 3528.466351] head: 0000000000000000 0000000080020002 00000001fdffffff ffff88812c785601 [ 3528.466428] head: 0017ffffc0000003 ffffea0005281001 ffffffffffffffff 0000000000000000 [ 3528.466504] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ 3528.466579] page dumped because: kasan: bad access detected [ 3528.466717] Memory state around the buggy address: [ 3528.466789] ffff88814a03ff00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 3528.466860] ffff88814a03ff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 3528.466931] >ffff88814a040000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 3528.467001] ^ [ 3528.467069] ffff88814a040080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 3528.467135] ffff88814a040100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 3528.467201] ==================================================================