>> >> Hi Dongli, >> >> > On 1/1/24 14:06, Tao Lyu wrote: >> >> Hello Arnabjyoti, Sean, and everyone, >> >> >> >> I'm having a similiar but slightly differnt issue about the rdtsc in KVM. >> >> >> >> I want to obtain the timestamp counter of physical/host machine inside the VMs. >> >> >> >> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling. >> >> I specify the correspoding parameters in the qemu arguments. >> >> The booting command is listed below: >> >> >> >> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0" >> >> >> >> >> >> But the rdtsc still returns the adjusted tsc. >> >> The vmxcap script shows the TSC settings as below: >> >> >> >> Use TSC offsetting no >> >> RDTSC exiting no >> >> Enable RDTSCP no >> >> TSC scaling yes >> >> >> >> >> >> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM. >> >> > If the objective is to obtain the same tsc at both VM and host side (that is, to >> > avoid any offset or scaling), I can obtain quite close tsc at both VM and host >> > side with the below linux-6.6 change. >> >> > My env does not use tsc scaling. >> >> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> > index 41cce50..b102dcd 100644 >> > --- a/arch/x86/kvm/x86.c >> > +++ b/arch/x86/kvm/x86.c >> >@@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 >> data) >> > bool synchronizing = false; >> > >> > raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags); >> > - offset = kvm_compute_l1_tsc_offset(vcpu, data); >> > + offset = 0; >> > ns = get_kvmclock_base_ns(); >> > elapsed = ns - kvm->arch.last_tsc_nsec; >> > >> > Dongli Zhang >> >> >> Hi Dongli, >> >> Thank you so much for the explanation and for providing a patch. >> It works for me now. > >Yeah, during vCPU creation KVM sets a target guest TSC of '0', i.e. sets the TSC >offset to "0 - HOST_TSC". As of commit 828ca89628bf ("KVM: x86: Expose TSC offset >controls to userspace"), userspace can explicitly set an offset of '0' via >KVM_VCPU_TSC_CTRL+KVM_VCPU_TSC_OFFSET, but AFAIK QEMU doesn't support that API. > >All the other methods for setting the TSC offset are indirect, i.e. userspace >provides the target TSC and KVM computes the offset. So even if QEMU provides a way to specify an explicit TSC (or offset), there will be a healthy amount of slop. Hi Sean and Dongli, Thank you so much for the replies. Unfortunately, after I adding the following patch to reset the TSC OFFSET forcefully, I can get the host TSC value from guest. However, when booting the host kernel, it has the following WARNINGS: [ 113.033750] ------------[ cut here ]------------ [ 113.033768] NETDEV WATCHDOG: enxb03af61ad78a (rndis_host): transmit queue 0 timed out [ 113.033802] WARNING: CPU: 42 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x264/0x270 [ 113.033829] Modules linked in: nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio socwatch2_15(OE) vtsspp(OE) vhost_net vhost vhost_iotlb tap sep5(OE) socperf3(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc overlay cuse pax(OE) ipmi_ssif zram intel_rapl_msr intel_rapl_common i10nm_edac x86_pkg_temp_thermal intel_powerclamp coretemp joydev input_leds nls_iso8859_1 kvm_intel ast hid_generic drm_vram_helper drm_ttm_helper kvm rndis_host ttm usbhid cdc_ether usbnet hid drm_kms_helper mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd cec i2c_algo_bit rapl fb_sys_fops syscopyarea intel_cstate sysfillrect i40e sysimgblt isst_if_mbox_pci mei_me ioatdma ahci [ 113.034307] isst_if_mmio i2c_i801 mei intel_pch_thermal isst_if_common acpi_ipmi libahci dca i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler nfit acpi_pad acpi_power_meter mac_hid sch_fq_codel binfmt_misc ramoops drm reed_solomon efi_pstore sunrpc ip_tables x_tables autofs4 [ 113.034473] CPU: 42 PID: 0 Comm: swapper/42 Kdump: loaded Tainted: G OE 5.15.0+ #4 [ 113.034486] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021 [ 113.034495] RIP: 0010:dev_watchdog+0x264/0x270 [ 113.034511] Code: eb a6 48 8b 5d d0 c6 05 e6 47 0a 01 01 48 89 df e8 91 c4 f9 ff 44 89 e1 48 89 de 48 c7 c7 68 c2 a8 a9 48 89 c2 e8 90 3e 16 00 <0f> 0b eb 83 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 [ 113.034522] RSP: 0018:ffffa6e88d79ce78 EFLAGS: 00010286 [ 113.034541] RAX: 0000000000000000 RBX: ffff959763ddb000 RCX: 000000000000083f [ 113.034551] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f [ 113.034559] RBP: ffffa6e88d79ceb0 R08: 0000000000000000 R09: ffffa6e88d79cc60 [ 113.034565] R10: ffffa6e88d79cc58 R11: ffff96163ff26c28 R12: 0000000000000000 [ 113.034572] R13: ffff95976488ac80 R14: 0000000000000001 R15: ffff959763ddb4c0 [ 113.034579] FS: 0000000000000000(0000) GS:ffff961541980000(0000) knlGS:0000000000000000 [ 113.034588] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 113.034595] CR2: 000055c5819e40f8 CR3: 0000004740c0a006 CR4: 0000000000772ee0 [ 113.034604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 113.034614] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 113.034620] PKRU: 55555554 [ 113.034629] Call Trace: [ 113.034636] <IRQ> [ 113.034662] call_timer_fn+0x29/0x100 [ 113.034680] __run_timers.part.0+0x1cf/0x240 [ 113.034757] run_timer_softirq+0x2a/0x50 [ 113.034768] __do_softirq+0xcb/0x274 [ 113.034790] irq_exit_rcu+0x8c/0xb0 [ 113.034807] sysvec_apic_timer_interrupt+0x7c/0x90 [ 113.034823] </IRQ> [ 113.034828] asm_sysvec_apic_timer_interrupt+0x12/0x20 [ 113.034841] RIP: 0010:cpuidle_enter_state+0xcc/0x360 [ 113.034861] Code: 3d c1 f7 26 57 e8 c4 09 74 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 35 15 74 ff 80 7d d7 00 0f 85 01 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 0d 01 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89 [ 113.034870] RSP: 0018:ffffa6e8809b7e68 EFLAGS: 00000246 [ 113.034884] RAX: ffff9615419a8dc0 RBX: 0000000000000002 RCX: 000000000000001f [ 113.034892] RDX: 0000000000000000 RSI: 000000003158cc4a RDI: 0000000000000000 [ 113.034900] RBP: ffffa6e8809b7ea0 R08: 0000001a5155ea28 R09: 0000000000000018 [ 113.034909] R10: 000000000006e15c R11: ffffffffa9e4b960 R12: ffffc6e87a991800 [ 113.034917] R13: ffffffffa9e4b960 R14: 0000001a5155ea28 R15: 0000000000000002 [ 113.034942] cpuidle_enter+0x2e/0x40 [ 113.034953] do_idle+0x1ff/0x2a0 [ 113.034966] cpu_startup_entry+0x20/0x30 [ 113.034979] start_secondary+0x11a/0x150 [ 113.034991] secondary_startup_64_no_verify+0xb0/0xbb [ 113.035008] ---[ end trace f39ffcbabd5dfe2e ]--- [ 533.511262] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 89916ns, attempt 4, marking unstable [ 533.511295] tsc: Marking TSC unstable due to clocksource watchdog [ 533.511336] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'. [ 533.511339] sched_clock: Marking unstable (533409196195, 102131418)<-(533549406780, -38078705) [ 533.512368] clocksource: Checking clocksource tsc synchronization from CPU 35 to CPUs 0,3,21-22,36,50,54. [ 533.513146] clocksource: Switched to clocksource hpet And after a while, the guest kernel will have the following error, and then the network doesn't work anymore. If I reboot the guest VM, then it will stuck and cannot be rebooted successfully. rcu: INFO: rcu_sched self-detected stall on CPU [ 336.374152] rcu: 3-...!: (1 GPs behind) idle=bb3/0/0x1 softirq=3087/3087 fqs=0 [ 336.379018] rcu: rcu_sched kthread timer wakeup didn't happen for 39086 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 336.386045] rcu: Possible timer handling issue on cpu=1 timer-softirq=871 [ 336.390353] rcu: rcu_sched kthread starved for 39089 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 [ 336.395881] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 336.400375] rcu: RCU grace-period kthread stack dump: [ 336.404091] rcu: Stack dump where RCU GP kthread last ran: [ 566.795685] rcu: INFO: rcu_sched self-detected stall on CPU [ 566.799315] rcu: 3-...!: (1 ticks this GP) idle=c65/0/0x1 softirq=3088/3088 fqs=1 [ 566.804170] rcu: rcu_sched kthread timer wakeup didn't happen for 229687 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 [ 566.811259] rcu: Possible timer handling issue on cpu=1 timer-softirq=872 [ 566.815579] rcu: rcu_sched kthread starved for 229690 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1 [ 566.821190] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior. [ 566.825813] rcu: RCU grace-period kthread stack dump: [ 566.829513] rcu: Stack dump where RCU GP kthread last ran: Looks like it leads to kernel misbehavior if we don't adjust the guest TSC value. Our goal is to get the almost synchronized TSC value among KVM VMs one the same host. Now I fix the host CPU frequency. Then the TSC OFFFSET, which can be read under "/sys/kernel/debug/kvm/qemu-PID/vcpu0/tsc-offset", will always keep constant. Every time when I execute the rdtsc inside the guest, I will subtract the offset to it get the TSC value, which can be close to the host TSC value. Do you think this makes sense? Thank you in advance Best, Tao