Re: obtain the timestamp counter of physical/host machine inside the VMs.

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Tue, 2 Jan 2024 11:36:01 -0800

Hi Tao,

On 1/2/24 10:20, Tao Lyu wrote:
>>>
>>> Hi Dongli,
>>>
>>>> On 1/1/24 14:06, Tao Lyu wrote:
>>>>> Hello Arnabjyoti, Sean, and everyone,
>>>>>
>>>>> I'm having a similiar but slightly differnt issue about the rdtsc in KVM.
>>>>>
>>>>> I want to obtain the timestamp counter of physical/host machine inside the VMs.
>>>>>
>>>>> Acccording to the previous threads, I know I need to disable the offsetting, VM exit, and scaling.
>>>>> I specify the correspoding parameters in the qemu arguments.
>>>>> The booting command is listed below:
>>>>>
>>>>> qemu-system-x86_64 -m 10240 -smp 4 -chardev socket,id=SOCKSYZ,server=on,nowait,host=localhost,port=3258 -mon chardev=SOCKSYZ,mode=control -display none -serial stdio -device virtio-rng-pci -enable-kvm -cpu host,migratable=off,tsc=on,rdtscp=on,vmx-tsc-offset=off,vmx-rdtsc-exit=off,tsc-scale=off,tsc-adjust=off,vmx-rdtscp-exit=off   -netdev bridge,id=hn40 -device virtio-net,netdev=hn40,mac=e6:c8:ff:09:76:38 -hda XXX -kernel XXX -append "root=/dev/sda console=ttyS0"
>>>>>
>>>>>
>>>>> But the rdtsc still returns the adjusted tsc.
>>>>> The vmxcap script shows the TSC settings as below:
>>>>>    
>>>>>    Use TSC offsetting                       no
>>>>>    RDTSC exiting                            no
>>>>>    Enable RDTSCP                            no
>>>>>    TSC scaling                              yes
>>>>>
>>>>>
>>>>> I would really appreciate it if anyone can tell me whether and how I can get the tsc of physical machine insdie the VM.
>>>
>>>> If the objective is to obtain the same tsc at both VM and host side (that is, to
>>>> avoid any offset or scaling), I can obtain quite close tsc at both VM and host
>>>> side with the below linux-6.6 change.
>>>
>>>> My env does not use tsc scaling.
>>>
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index 41cce50..b102dcd 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -2723,7 +2723,7 @@ static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64
>>> data)
>>>>         bool synchronizing = false;
>>>>
>>>>        raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
>>>> -       offset = kvm_compute_l1_tsc_offset(vcpu, data);
>>>> +       offset = 0;
>>>>         ns = get_kvmclock_base_ns();
>>>>         elapsed = ns - kvm->arch.last_tsc_nsec;
>>>>
>>>> Dongli Zhang
>>>
>>>
>>> Hi Dongli,
>>>
>>> Thank you so much for the explanation and for providing a patch.
>>> It works for me now.
>>
>> Yeah, during vCPU creation KVM sets a target guest TSC of '0', i.e. sets the TSC
>> offset to "0 - HOST_TSC".  As of commit 828ca89628bf ("KVM: x86: Expose TSC offset
>> controls to userspace"), userspace can explicitly set an offset of '0' via
>> KVM_VCPU_TSC_CTRL+KVM_VCPU_TSC_OFFSET, but AFAIK QEMU doesn't support that API.
>>
>> All the other methods for setting the TSC offset are indirect, i.e. userspace
>> provides the target TSC and KVM computes the offset.  So even if QEMU provides a
> way to specify an explicit TSC (or offset), there will be a healthy amount of slop.
> 
> 
> Hi Sean and Dongli,
> 
> Thank you so much for the replies.
> 
> Unfortunately, after I adding the following patch to reset the TSC OFFSET forcefully,
> I can get the host TSC value from guest.
> 
> However, when booting the host kernel, it has the following WARNINGS:

My test patch will not impact the host time, when booting the host kernel. It
will not take effect until the VM is created.

Therefore, I guess the below is due to other reasons in your host kernel.

> 
> 
> [  113.033750] ------------[ cut here ]------------
> [  113.033768] NETDEV WATCHDOG: enxb03af61ad78a (rndis_host): transmit queue 0 timed out
> [  113.033802] WARNING: CPU: 42 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x264/0x270
> [  113.033829] Modules linked in: nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio socwatch2_15(OE) vtsspp(OE) vhost_net vhost vhost_iotlb tap sep5(OE) socperf3(OE) xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables libcrc32c nfnetlink ip6table_filter ip6_tables iptable_filter bpfilter bridge stp llc overlay cuse pax(OE) ipmi_ssif zram intel_rapl_msr intel_rapl_common i10nm_edac x86_pkg_temp_thermal intel_powerclamp coretemp joydev input_leds nls_iso8859_1 kvm_intel ast hid_generic drm_vram_helper drm_ttm_helper kvm rndis_host ttm usbhid cdc_ether usbnet hid drm_kms_helper mii crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd cec i2c_algo_bit rapl fb_sys_fops syscopyarea intel_cstate sysfillrect i40e sysimgblt isst_if_mbox_pci mei_me ioatdma ahci
> [  113.034307]  isst_if_mmio i2c_i801 mei intel_pch_thermal isst_if_common acpi_ipmi libahci dca i2c_smbus wmi ipmi_si ipmi_devintf ipmi_msghandler nfit acpi_pad acpi_power_meter mac_hid sch_fq_codel binfmt_misc ramoops drm reed_solomon efi_pstore sunrpc ip_tables x_tables autofs4
> [  113.034473] CPU: 42 PID: 0 Comm: swapper/42 Kdump: loaded Tainted: G           OE     5.15.0+ #4
> [  113.034486] Hardware name: Intel Corporation M50CYP2SB1U/M50CYP2SB1U, BIOS SE5C620.86B.01.01.0004.2110190142 10/19/2021
> [  113.034495] RIP: 0010:dev_watchdog+0x264/0x270
> [  113.034511] Code: eb a6 48 8b 5d d0 c6 05 e6 47 0a 01 01 48 89 df e8 91 c4 f9 ff 44 89 e1 48 89 de 48 c7 c7 68 c2 a8 a9 48 89 c2 e8 90 3e 16 00 <0f> 0b eb 83 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41
> [  113.034522] RSP: 0018:ffffa6e88d79ce78 EFLAGS: 00010286
> [  113.034541] RAX: 0000000000000000 RBX: ffff959763ddb000 RCX: 000000000000083f
> [  113.034551] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
> [  113.034559] RBP: ffffa6e88d79ceb0 R08: 0000000000000000 R09: ffffa6e88d79cc60
> [  113.034565] R10: ffffa6e88d79cc58 R11: ffff96163ff26c28 R12: 0000000000000000
> [  113.034572] R13: ffff95976488ac80 R14: 0000000000000001 R15: ffff959763ddb4c0
> [  113.034579] FS:  0000000000000000(0000) GS:ffff961541980000(0000) knlGS:0000000000000000
> [  113.034588] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  113.034595] CR2: 000055c5819e40f8 CR3: 0000004740c0a006 CR4: 0000000000772ee0
> [  113.034604] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  113.034614] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  113.034620] PKRU: 55555554
> [  113.034629] Call Trace:
> [  113.034636]  <IRQ>
> [  113.034662]  call_timer_fn+0x29/0x100
> [  113.034680]  __run_timers.part.0+0x1cf/0x240
> [  113.034757]  run_timer_softirq+0x2a/0x50
> [  113.034768]  __do_softirq+0xcb/0x274
> [  113.034790]  irq_exit_rcu+0x8c/0xb0
> [  113.034807]  sysvec_apic_timer_interrupt+0x7c/0x90
> [  113.034823]  </IRQ>
> [  113.034828]  asm_sysvec_apic_timer_interrupt+0x12/0x20
> [  113.034841] RIP: 0010:cpuidle_enter_state+0xcc/0x360
> [  113.034861] Code: 3d c1 f7 26 57 e8 c4 09 74 ff 49 89 c6 0f 1f 44 00 00 31 ff e8 35 15 74 ff 80 7d d7 00 0f 85 01 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 ff 0f 88 0d 01 00 00 49 63 cf 4c 2b 75 c8 48 8d 04 49 48 89
> [  113.034870] RSP: 0018:ffffa6e8809b7e68 EFLAGS: 00000246
> [  113.034884] RAX: ffff9615419a8dc0 RBX: 0000000000000002 RCX: 000000000000001f
> [  113.034892] RDX: 0000000000000000 RSI: 000000003158cc4a RDI: 0000000000000000
> [  113.034900] RBP: ffffa6e8809b7ea0 R08: 0000001a5155ea28 R09: 0000000000000018
> [  113.034909] R10: 000000000006e15c R11: ffffffffa9e4b960 R12: ffffc6e87a991800
> [  113.034917] R13: ffffffffa9e4b960 R14: 0000001a5155ea28 R15: 0000000000000002
> [  113.034942]  cpuidle_enter+0x2e/0x40
> [  113.034953]  do_idle+0x1ff/0x2a0
> [  113.034966]  cpu_startup_entry+0x20/0x30
> [  113.034979]  start_secondary+0x11a/0x150
> [  113.034991]  secondary_startup_64_no_verify+0xb0/0xbb
> [  113.035008] ---[ end trace f39ffcbabd5dfe2e ]---
> 
> [  533.511262] clocksource: timekeeping watchdog on CPU53: hpet read-back delay of 89916ns, attempt 4, marking unstable
> [  533.511295] tsc: Marking TSC unstable due to clocksource watchdog
> [  533.511336] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> [  533.511339] sched_clock: Marking unstable (533409196195, 102131418)<-(533549406780, -38078705)
> [  533.512368] clocksource: Checking clocksource tsc synchronization from CPU 35 to CPUs 0,3,21-22,36,50,54.
> [  533.513146] clocksource: Switched to clocksource hpet
> 
> 
> And after a while, the guest kernel will have the following  error, and then the network doesn't work anymore.
> If I reboot the guest VM, then it will stuck and cannot be rebooted successfully.
> 
> rcu: INFO: rcu_sched self-detected stall on CPU
> [  336.374152] rcu: 	3-...!: (1 GPs behind) idle=bb3/0/0x1 softirq=3087/3087 fqs=0 
> [  336.379018] rcu: rcu_sched kthread timer wakeup didn't happen for 39086 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [  336.386045] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=871
> [  336.390353] rcu: rcu_sched kthread starved for 39089 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
> [  336.395881] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [  336.400375] rcu: RCU grace-period kthread stack dump:
> [  336.404091] rcu: Stack dump where RCU GP kthread last ran:
> [  566.795685] rcu: INFO: rcu_sched self-detected stall on CPU
> [  566.799315] rcu: 	3-...!: (1 ticks this GP) idle=c65/0/0x1 softirq=3088/3088 fqs=1 
> [  566.804170] rcu: rcu_sched kthread timer wakeup didn't happen for 229687 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402
> [  566.811259] rcu: 	Possible timer handling issue on cpu=1 timer-softirq=872
> [  566.815579] rcu: rcu_sched kthread starved for 229690 jiffies! g3941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=1
> [  566.821190] rcu: 	Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
> [  566.825813] rcu: RCU grace-period kthread stack dump:
> [  566.829513] rcu: Stack dump where RCU GP kthread last ran:
> 
> 
> Looks like it leads to kernel misbehavior if we don't adjust the guest TSC value.
> 
> Our goal is to get the almost synchronized TSC value among KVM VMs one the same host.
> 
> Now I fix the host CPU frequency. Then the TSC OFFFSET, which can be read under "/sys/kernel/debug/kvm/qemu-PID/vcpu0/tsc-offset", will always keep constant.
> 
> Every time when I execute the rdtsc inside the guest, I will subtract the offset to it get the TSC value, which can be close to the host TSC value.
> 
> Do you think this makes sense?

I assume you do not use TSC scaling or any nested virtualization.

Therefore, the value in the debugfs should be the same as the one computed by
"kvm_compute_l1_tsc_offset(vcpu, data)". You may use printk to double confirm.

I think it makes sense.

However, I do not think the patch may cause the issue. It works in my environment.

Thank you very much!

Dongli Zhang

> 
> Thank you in advance
> 
> Best,
> Tao