Hi Tao, On 1/2/24 10:20, Tao Lyu wrote: >>> >>> Hi Dongli, >>> >>>> On 1/1/24 14:06, Tao Lyu wrote: >>>>> Hello Arnabjyoti, Sean, and everyone, >>>>> >>>>> I'm having a similiar but slightly differnt issue about the rdtsc in KVM. >>>>> >>>>> I want to obtain the timestamp counter of physical/host machine inside the VMs. >>>>> >>>>> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling. >>>>> I specify the correspoding parameters in the qemu arguments. >>>>> The booting command is listed below: >>>>> >>>>> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0" >>>>> >>>>> >>>>> But the rdtsc still returns the adjusted tsc. >>>>> The vmxcap script shows the TSC settings as below: >>>>> >>>>> Use TSC offsetting no >>>>> RDTSC exiting no >>>>> Enable RDTSCP no >>>>> TSC scaling yes >>>>> >>>>> >>>>> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM. >>> >>>> If the objective is to obtain the same tsc at both VM and host side (that is, to >>>> avoid any offset or scaling), I can obtain quite close tsc at both VM and host >>>> side with the below linux-6.6 change. >>> >>>> My env does not use tsc scaling. >>> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>> index 41cce50..b102dcd 100644 >>>> --- a/arch/x86/kvm/x86.c >>>> +++ b/arch/x86/kvm/x86.c >>>> @@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 >>> data) >>>> bool synchronizing = false; >>>> >>>> raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); >>>> - offset = kvm_compute_l1_tsc_offset(vcpu, data); >>>> + offset = 0; >>>> ns = get_kvmclock_base_ns(); >>>> elapsed = ns - kvm->arch.last_tsc_nsec; >>>> >>>> Dongli Zhang >>> >>> >>> Hi Dongli, >>> >>> Thank you so much for the explanation and for providing a patch. >>> It works for me now. >> >> Yeah, during vCPU creation KVM sets a target guest TSC of '0', i.e. sets the TSC >> offset to "0 - HOST_TSC". As of commit 828ca89628bf ("KVM: x86: Expose TSC offset >> controls to userspace"), userspace can explicitly set an offset of '0' via >> KVM_VCPU_TSC_CTRL+KVM_VCPU_TSC_OFFSET, but AFAIK QEMU doesn't support that API. >> >> All the other methods for setting the TSC offset are indirect, i.e. userspace >> provides the target TSC and KVM computes the offset. So even if QEMU provides a > way to specify an explicit TSC (or offset), there will be a healthy amount of slop. > > > Hi Sean and Dongli, > > Thank you so much for the replies. > > Unfortunately, after I adding the following patch to reset the TSC OFFSET forcefully, > I can get the host TSC value from guest. > > However, when booting the host kernel, it has the following WARNINGS: My test patch will not impact the host time, when booting the host kernel. It will not take effect until the VM is created. Therefore, I guess the below is due to other reasons in your host kernel. > > > [ 113.033750] ------------[ cut here ]------------ > [ 113.033768] NETDEV WATCHDOG: enxb03af61ad78a (rndis_host): transmit queue 0 timed out > [ 113.033802] WARNING: CPU: 42 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x264/0x270 > [ 113.033829] Modules linked in: nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio socwatch2_15(OE) vtsspp(OE) vhost_net vhost vhost_iotlb tap sep5(OE) socperf3(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc overlay cuse pax(OE) ipmi_ssif zram intel_rapl_msr intel_rapl_common i10nm_edac x86_pkg_temp_thermal intel_powerclamp coretemp joydev input_leds nls_iso8859_1 kvm_intel ast hid_generic drm_vram_helper drm_ttm_helper kvm rndis_host ttm usbhid cdc_ether usbnet hid drm_kms_helper mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd cec i2c_algo_bit rapl fb_sys_fops syscopyarea intel_cstate sysfillrect i40e sysimgblt isst_if_mbox_pci mei_me ioatdma ahci > [ 113.034307] isst_if_mmio i2c_i801 mei intel_pch_thermal isst_if_common acpi_ipmi libahci dca i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler nfit acpi_pad acpi_power_meter mac_hid sch_fq_codel binfmt_misc ramoops drm reed_solomon efi_pstore sunrpc ip_tables x_tables autofs4 > [ 113.034473] CPU: 42 PID: 0 Comm: swapper/42 Kdump: loaded Tainted: G OE 5.15.0+ #4 > [ 113.034486] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021 > [ 113.034495] RIP: 0010:dev_watchdog+0x264/0x270 > [ 113.034511] Code: eb a6 48 8b 5d d0 c6 05 e6 47 0a 01 01 48 89 df e8 91 c4 f9 ff 44 89 e1 48 89 de 48 c7 c7 68 c2 a8 a9 48 89 c2 e8 90 3e 16 00 <0f> 0b eb 83 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 > [ 113.034522] RSP: 0018:ffffa6e88d79ce78 EFLAGS: 00010286 > [ 113.034541] RAX: 0000000000000000 RBX: ffff959763ddb000 RCX: 000000000000083f > [ 113.034551] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f > [ 113.034559] RBP: ffffa6e88d79ceb0 R08: 0000000000000000 R09: ffffa6e88d79cc60 > [ 113.034565] R10: ffffa6e88d79cc58 R11: ffff96163ff26c28 R12: 0000000000000000 > [ 113.034572] R13: ffff95976488ac80 R14: 0000000000000001 R15: ffff959763ddb4c0 > [ 113.034579] FS: 0000000000000000(0000) GS:ffff961541980000(0000) knlGS:0000000000000000 > [ 113.034588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 113.034595] CR2: 000055c5819e40f8 CR3: 0000004740c0a006 CR4: 0000000000772ee0 > [ 113.034604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 113.034614] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 113.034620] PKRU: 55555554 > [ 113.034629] Call Trace: > [ 113.034636] <IRQ> > [ 113.034662] call_timer_fn+0x29/0x100 > [ 113.034680] __run_timers.part.0+0x1cf/0x240 > [ 113.034757] run_timer_softirq+0x2a/0x50 > [ 113.034768] __do_softirq+0xcb/0x274 > [ 113.034790] irq_exit_rcu+0x8c/0xb0 > [ 113.034807] sysvec_apic_timer_interrupt+0x7c/0x90 > [ 113.034823] </IRQ> > [ 113.034828] asm_sysvec_apic_timer_interrupt+0x12/0x20 > [ 113.034841] RIP: 0010:cpuidle_enter_state+0xcc/0x360 > [ 113.034861] Code: 3d c1 f7 26 57 e8 c4 09 74 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 35 15 74 ff 80 7d d7 00 0f 85 01 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 0d 01 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 > [ 113.034870] RSP: 0018:ffffa6e8809b7e68 EFLAGS: 00000246 > [ 113.034884] RAX: ffff9615419a8dc0 RBX: 0000000000000002 RCX: 000000000000001f > [ 113.034892] RDX: 0000000000000000 RSI: 000000003158cc4a RDI: 0000000000000000 > [ 113.034900] RBP: ffffa6e8809b7ea0 R08: 0000001a5155ea28 R09: 0000000000000018 > [ 113.034909] R10: 000000000006e15c R11: ffffffffa9e4b960 R12: ffffc6e87a991800 > [ 113.034917] R13: ffffffffa9e4b960 R14: 0000001a5155ea28 R15: 0000000000000002 > [ 113.034942] cpuidle_enter+0x2e/0x40 > [ 113.034953] do_idle+0x1ff/0x2a0 > [ 113.034966] cpu_startup_entry+0x20/0x30 > [ 113.034979] start_secondary+0x11a/0x150 > [ 113.034991] secondary_startup_64_no_verify+0xb0/0xbb > [ 113.035008] ---[ end trace f39ffcbabd5dfe2e ]--- > > [ 533.511262] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 89916ns, attempt 4, marking unstable > [ 533.511295] tsc: Marking TSC unstable due to clocksource watchdog > [ 533.511336] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. > [ 533.511339] sched_clock: Marking unstable (533409196195, 102131418)<-(533549406780, -38078705) > [ 533.512368] clocksource: Checking clocksource tsc synchronization from CPU 35 to CPUs 0,3,21-22,36,50,54. > [ 533.513146] clocksource: Switched to clocksource hpet > > > And after a while, the guest kernel will have the following error, and then the network doesn't work anymore. > If I reboot the guest VM, then it will stuck and cannot be rebooted successfully. > > rcu: INFO: rcu_sched self-detected stall on CPU > [ 336.374152] rcu: 3-...!: (1 GPs behind) idle=bb3/0/0x1 softirq=3087/3087 fqs=0 > [ 336.379018] rcu: rcu_sched kthread timer wakeup didn't happen for 39086 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 > [ 336.386045] rcu: Possible timer handling issue on cpu=1 timer-softirq=871 > [ 336.390353] rcu: rcu_sched kthread starved for 39089 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 > [ 336.395881] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. > [ 336.400375] rcu: RCU grace-period kthread stack dump: > [ 336.404091] rcu: Stack dump where RCU GP kthread last ran: > [ 566.795685] rcu: INFO: rcu_sched self-detected stall on CPU > [ 566.799315] rcu: 3-...!: (1 ticks this GP) idle=c65/0/0x1 softirq=3088/3088 fqs=1 > [ 566.804170] rcu: rcu_sched kthread timer wakeup didn't happen for 229687 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 > [ 566.811259] rcu: Possible timer handling issue on cpu=1 timer-softirq=872 > [ 566.815579] rcu: rcu_sched kthread starved for 229690 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 > [ 566.821190] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. > [ 566.825813] rcu: RCU grace-period kthread stack dump: > [ 566.829513] rcu: Stack dump where RCU GP kthread last ran: > > > Looks like it leads to kernel misbehavior if we don't adjust the guest TSC value. > > Our goal is to get the almost synchronized TSC value among KVM VMs one the same host. > > Now I fix the host CPU frequency. Then the TSC OFFFSET, which can be read under "/sys/kernel/debug/kvm/qemu-PID/vcpu0/tsc-offset", will always keep constant. > > Every time when I execute the rdtsc inside the guest, I will subtract the offset to it get the TSC value, which can be close to the host TSC value. > > Do you think this makes sense? I assume you do not use TSC scaling or any nested virtualization. Therefore, the value in the debugfs should be the same as the one computed by "kvm_compute_l1_tsc_offset(vcpu, data)". You may use printk to double confirm. I think it makes sense. However, I do not think the patch may cause the issue. It works in my environment. Thank you very much! Dongli Zhang > > Thank you in advance > > Best, > Tao