On Thu, Dec 03, 2020 at 01:39:42PM +0200, Maxim Levitsky wrote: > On Tue, 2020-12-01 at 16:48 -0300, Marcelo Tosatti wrote: > > On Tue, Dec 01, 2020 at 02:30:39PM +0200, Maxim Levitsky wrote: > > > On Mon, 2020-11-30 at 16:16 -0300, Marcelo Tosatti wrote: > > > > Hi Maxim, > > > > > > > > On Mon, Nov 30, 2020 at 03:35:57PM +0200, Maxim Levitsky wrote: > > > > > Hi! > > > > > > > > > > This is the first version of the work to make TSC migration more accurate, > > > > > as was defined by Paulo at: > > > > > https://www.spinics.net/lists/kvm/msg225525.html > > > > > > > > Description from Oliver's patch: > > > > > > > > "To date, VMMs have typically restored the guest's TSCs by value using > > > > the KVM_SET_MSRS ioctl for each vCPU. However, restoring the TSCs by > > > > value introduces some challenges with synchronization as the TSCs > > > > continue to tick throughout the restoration process. As such, KVM has > > > > some heuristics around TSC writes to infer whether or not the guest or > > > > host is attempting to synchronize the TSCs." > > > > > > > > Not really. The synchronization logic tries to sync TSCs during > > > > BIOS boot (and CPU hotplug), because the TSC values are loaded > > > > sequentially, say: > > > > > > > > CPU realtime TSC val > > > > vcpu0 0 usec 0 > > > > vcpu1 100 usec 0 > > > > vcpu2 200 usec 0 > > > > ... > > > > > > > > And we'd like to see all vcpus to read the same value at all times. > > > > > > > > Other than that, comment makes sense. The problem with live migration > > > > is as follows: > > > > > > > > We'd like the TSC value to be written, ideally, just before the first > > > > VM-entry a vCPU (because at the moment the TSC_OFFSET has been written, > > > > the vcpus tsc is ticking, which will cause a visible forward jump > > > > in vcpus tsc time). > > > > > > > > Before the first VM-entry is the farthest point in time before guest > > > > entry that one could do that. > > > > > > > > The window (or forward jump) between KVM_SET_TSC and VM-entry was about > > > > 100ms last time i checked (which results in a 100ms time jump forward), > > > > See QEMU's 6053a86fe7bd3d5b07b49dae6c05f2cd0d44e687. > > > > > > > > Have we measured any improvement with this patchset? > > > > > > Its not about this window. > > > It is about time that passes between the point that we read the > > > TSC on source system (and we do it in qemu each time the VM is paused) > > > and the moment that we set the same TSC value on the target. > > > That time is unbounded. > > > > OK. Well, its the same problem: ideally you'd want to do that just > > before VCPU-entry. > > > > > Also this patchset should decrease TSC skew that happens > > > between restoring it on multiple vCPUs as well, since > > > KVM_SET_TSC_STATE doesn't have to happen at the same time, > > > as it accounts for time passed on each vCPU. > > > > > > > > > Speaking of kvmclock, somewhat offtopic since this is a different issue, > > > I found out that qemu reads the kvmclock value on each pause, > > > and then 'restores' on unpause, using > > > KVM_SET_CLOCK (this modifies the global kvmclock offset) > > > > > > This means (and I tested it) that if guest uses kvmclock > > > for time reference, it will not account for time passed in > > > the paused state. > > > > Yes, this is necessary because otherwise there might be an overflow > > in the kernel time accounting code (if the clock delta is too large). > > Could you elaborate on this? Do you mean that guest kernel can crash, > when the time 'jumps' too far forward in one go? It can crash (there will a overflow and time will jump backwards). > If so this will happen with kernel using TSC as well, > since we do let the virtual TSC to 'keep running' while VM is suspended, > and the goal of this patchset is to let it 'run' even while > the VM is migrating. True. For the overflow one, perhaps TSC can be stopped (and restored) similarly to KVMCLOCK. See QEMU's commit 00f4d64ee76e873be881a82d893a591487aa7950. > And if there is an issue, we really should try to fix it in > the guest kernel IMHO. > > > > > > > Then Paolo mentions (with >), i am replying as usual. > > > > > > > > > Ok, after looking more at the code with Maxim I can confidently say that > > > > > it's a total mess. And a lot of the synchronization code is dead > > > > > because 1) as far as we could see no guest synchronizes the TSC using > > > > > MSR_IA32_TSC; > > > > > > > > Well, recent BIOS'es take care of synchronizing the TSC. So when Linux > > > > boots, it does not have to synchronize TSC in software. > > > > > > Do you have an example of such BIOS? I tested OVMF which I compiled > > > from git master a few weeks ago, and I also tested this with seabios > > > from qemu repo, and I have never seen writes to either TSC, or TSC_ADJUST > > > from BIOS. > > > > Oh, well, QEMU then. > > > > > Or do you refer to the native BIOS on the host doing TSC synchronization? > > > > No, virt. > > I also (lightly) tested win10 guest, and win10 guest with Hyper-V enabled, > and in both cases I haven't observed TSC/TSC_ADJUST writes. > > > > > > > However, upon migration (and initialization), the KVM_SET_TSC's do > > > > not happen at exactly the same time (the MSRs for each vCPU are loaded > > > > in sequence). The synchronization code in kvm_set_tsc() is for those cases. > > > > > > I agree with that, and this is one of the issues that KVM_SET_TSC_STATE > > > is going to fix, since it accounts for it. > > > > > > > > > > > and 2) writing to MSR_IA32_TSC_ADJUST does not trigger the > > > > > synchronization code in kvm_write_tsc. > > > > > > > > Not familiar how guests are using MSR_IA32_TSC_ADJUST (or Linux)... > > > > Lets see: > > > > > > > > > > > > /* > > > > * Freshly booted CPUs call into this: > > > > */ > > > > void check_tsc_sync_target(void) > > > > { > > > > struct tsc_adjust *cur = this_cpu_ptr(&tsc_adjust); > > > > unsigned int cpu = smp_processor_id(); > > > > cycles_t cur_max_warp, gbl_max_warp; > > > > int cpus = 2; > > > > > > > > /* Also aborts if there is no TSC. */ > > > > if (unsynchronized_tsc()) > > > > return; > > > > > > > > /* > > > > * Store, verify and sanitize the TSC adjust register. If > > > > * successful skip the test. > > > > * > > > > * The test is also skipped when the TSC is marked reliable. This > > > > * is true for SoCs which have no fallback clocksource. On these > > > > * SoCs the TSC is frequency synchronized, but still the TSC ADJUST > > > > * register might have been wreckaged by the BIOS.. > > > > */ > > > > if (tsc_store_and_check_tsc_adjust(false) || tsc_clocksource_reliable) { > > > > atomic_inc(&skip_test); > > > > return; > > > > } > > > > > > > > retry: > > > > > > > > I'd force that synchronization path to be taken as a test-case. > > > > > > Or even better as I suggested, we might tell the guest kernel > > > to avoid this synchronization path when KVM is detected > > > (regardless of invtsc flag) > > > > > > > > > > > > I have a few thoughts about the kvm masterclock synchronization, > > > > > which relate to the Paulo's proposal that I implemented. > > > > > > > > > > The idea of masterclock is that when the host TSC is synchronized > > > > > (or as kernel call it, stable), and the guest TSC is synchronized as well, > > > > > then we can base the kvmclock, on the same pair of > > > > > (host time in nsec, host tsc value), for all vCPUs. > > > > > > > > We _have_ to base. See the comment which starts with > > > > > > > > "Assuming a stable TSC across physical CPUS, and a stable TSC" > > > > > > > > at x86.c. > > > > > > > > > This makes the random error in calculation of this value invariant > > > > > across vCPUS, and allows the guest to do kvmclock calculation in userspace > > > > > (vDSO) since kvmclock parameters are vCPU invariant. > > > > > > > > Actually, without synchronized host TSCs (and the masterclock scheme, > > > > with a single base read from a vCPU), kvmclock in kernel is buggy as > > > > well: > > > > > > > > u64 pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) > > > > { > > > > unsigned version; > > > > u64 ret; > > > > u64 last; > > > > u8 flags; > > > > > > > > do { > > > > version = pvclock_read_begin(src); > > > > ret = __pvclock_read_cycles(src, rdtsc_ordered()); > > > > flags = src->flags; > > > > } while (pvclock_read_retry(src, version)); > > > > > > > > if (unlikely((flags & PVCLOCK_GUEST_STOPPED) != 0)) { > > > > src->flags &= ~PVCLOCK_GUEST_STOPPED; > > > > pvclock_touch_watchdogs(); > > > > } > > > > > > > > if ((valid_flags & PVCLOCK_TSC_STABLE_BIT) && > > > > (flags & PVCLOCK_TSC_STABLE_BIT)) > > > > return ret; > > > > > > > > The code that follows this (including cmpxchg) is a workaround for that > > > > bug. > > > > > > I understand that. I am not arguing that we shoudn't use the masterclock! > > > I am just saying the facts about the condition when it works. > > > > Sure. > > > > > > Workaround would require each vCPU to write to a "global clock", on > > > > every clock read. > > > > > > > > > To ensure that the guest tsc is synchronized we currently track host/guest tsc > > > > > writes, and enable the master clock only when roughly the same guest's TSC value > > > > > was written across all vCPUs. > > > > > > > > Yes, because then you can do: > > > > > > > > vcpu0 vcpu1 > > > > > > > > A = read TSC > > > > ... elapsed time ... > > > > > > > > B = read TSC > > > > > > > > delta = B - A > > > > > > > > > Recently this was disabled by Paulo > > > > > > > > What was disabled exactly? > > > > > > The running of tsc synchronization code when the _guest_ writes the TSC. > > > > > > Which changes two things: > > > 1. If the guest de-synchronizes its TSC, we won't disable master clock. > > > 2. If the guest writes similar TSC values on each vCPU we won't detect > > > this as synchronization attempt, replace this with exactly the same > > > value and finally re-enable the master clock. > > > > > > I argue that this change is OK, because Linux guests don't write to TSC at all, > > > the virtual BIOSes seems not to write there either, and the only case in which > > > the Linux guest tries to change its TSC is on CPU hotplug as you mention and > > > it uses TSC_ADJUST, that currently doesn't trigger TSC synchronization code in > > > KVM anyway, so it is broken already. > > > > > > However I also argue that we should mention this in documentation just in case, > > > and we might also want (also just in case) to make Linux guests avoid even trying to > > > touch TSC_ADJUST register when running under KVM. > > > > > > To rehash my own words, the KVM_CLOCK_TSC_STABLE should be defined as: > > > 'kvmclock is vCPU invariant, as long as the guest doesn't mess with its TSC'. > > > > > > Having said all that, now that I know tsc sync code, and the > > > reasons why it is there, I wouldn't be arguing about putting it back either. > > > > Agree. > > > > > > > and I agree with this, because I think > > > > > that we indeed should only make the guest TSC synchronized by default > > > > > (including new hotplugged vCPUs) and not do any tsc synchronization beyond that. > > > > > (Trying to guess when the guest syncs the TSC can cause more harm that good). > > > > > > > > > > Besides, Linux guests don't sync the TSC via IA32_TSC write, > > > > > but rather use IA32_TSC_ADJUST which currently doesn't participate > > > > > in the tsc sync heruistics. > > > > > > > > Linux should not try to sync the TSC with IA32_TSC_ADJUST. It expects > > > > the BIOS to boot with synced TSCs. > > > > > > > > So i wonder what is making it attempt TSC sync in the first place? > > > > > > CPU hotplug. And the guest doesn't really write to TSC_ADJUST > > > since it's measurement code doesn't detect any tsc warps. > > > > > > I was just thinking that in theory since, this is a VM, and it can be > > > interrupted at any point, the measurement code should sometimes fall, > > > and cause trouble. > > > I didn't do much homework on this so I might be overreacting. > > > > That is true (and you can see it with a CPU starved guest). > > > > > As far as I see X86_FEATURE_TSC_RELIABLE was done mostly to support > > > running under Hyper-V and VMWARE, and these should be prone to similar > > > issues, supporting my theory. > > > > > > > (one might also want to have Linux's synchronization via IA32_TSC_ADJUST > > > > working, but it should not need to happen in the first place, as long as > > > > QEMU and KVM are behaving properly). > > > > > > > > > And as far as I know, Linux guest is the primary (only?) user of the kvmclock. > > > > > > > > Only AFAIK. > > > > > > > > > I *do think* however that we should redefine KVM_CLOCK_TSC_STABLE > > > > > in the documentation to state that it only guarantees invariance if the guest > > > > > doesn't mess with its own TSC. > > > > > > > > > > Also I think we should consider enabling the X86_FEATURE_TSC_RELIABLE > > > > > in the guest kernel, when kvm is detected to avoid the guest even from trying > > > > > to sync TSC on newly hotplugged vCPUs. > > > > > > > > See 7539b174aef405d9d57db48c58390ba360c91312. > > > > > > I know about this, and I personally always use invtsc > > > with my VMs. > > > > Well, we can't make it (-cpu xxx,+invtsc) the default if vm-stop/vm-cont are unstable > > with TSC! > > Could you elaborate on this too? Are you referring to the same issue you > had mentioned about the overflow in the kernel time accounting? Well, any issue that could show up. > > > > Was hoping to make that (-cpu xxx,+invtsc) the default in QEMU once invariant TSC code > > > > becomes stable. Should be tested enough by now? > > > > > > The issue is that Qemu blocks migration when invtsc is set, based on the > > > fact that the target machine might have different TSC frequency and no > > > support for TSC scaling. > > > There was a long debate on this long ago. > > > > Oh right. > > > > > It is possible though to override this by specifying the exact frequency > > > you want the guest TSC to run at, by using something like > > > (tsc-frequency=3500000000) > > > I haven't checked if libvirt does this or not. > > > > It does. > Cool. > > > > > I do think that as long as the user uses modern CPUs (which have stable TSC > > > and support TSC scaling), there is no reason to disable invtsc, and > > > therefore no reason to use kvmclock. > > > > Yep. TSC is faster. > > Also this bit is sometimes used by userspace tools. Yep! SAP HANA as well. > Some time ago I found out that fio uses it to decide whether > to use TSC for measurements. > > I didn't know this and was running fio in a guest without 'invtsc'. > Fio switched to plain gettimeofday behind my back > and totally screwed up the results. > > > > > > > > (The guest doesn't end up touching TSC_ADJUST usually, but it still might > > > > > in some cases due to scheduling of guest vCPUs) > > > > > > > > > > (X86_FEATURE_TSC_RELIABLE short circuits tsc synchronization on CPU hotplug, > > > > > and TSC clocksource watchdog, and the later we might want to keep). > > > > > > > > The latter we want to keep. > > > > > > > > > For host TSC writes, just as Paulo proposed we can still do the tsc sync, > > > > > unless the new code that I implemented is in use. > > > > > > > > So Paolo's proposal is to > > > > > > > > "- for live migration, userspace is expected to use the new > > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a > > > > (nanosecond, TSC, TSC_ADJUST) tuple." > > > > > > > > Makes sense, so that no time between KVM_SET_TSC and > > > > MSR_WRITE(TSC_ADJUST) elapses, which would cause the TSC to go out > > > > of what is desired by the user. > > > > > > > > Since you are proposing this new ioctl, perhaps its useful to also > > > > reduce the 100ms jump? > > > > > > Yep. As long as target and destantion clocks are synchronized, > > > it should make it better. > > > > > > > "- for live migration, userspace is expected to use the new > > > > KVM_GET/SET_TSC_PRECISE (or whatever the name will be) to get/set a > > > > (nanosecond, TSC, TSC_ADJUST) tuple. This value will be written > > > > to the guest before the first VM-entry" > > > > > > > > Sounds like a good idea (to integrate the values in a tuple). > > > > > > > > > Few more random notes: > > > > > > > > > > I have a weird feeling about using 'nsec since 1 January 1970'. > > > > > Common sense is telling me that a 64 bit value can hold about 580 years, > > > > > but still I see that it is more common to use timespec which is a (sec,nsec) pair. > > > > > > > > struct timespec { > > > > time_t tv_sec; /* seconds */ > > > > long tv_nsec; /* nanoseconds */ > > > > }; > > > > > > > > > I feel that 'kvm_get_walltime' that I added is a bit of a hack. > > > > > Some refactoring might improve things here. > > > > > > > > Haven't read the patchset yet... > > > > > > > > > For example making kvm_get_walltime_and_clockread work in non tsc case as well > > > > > might make the code cleaner. > > > > > > > > > > Patches to enable this feature in qemu are in process of being sent to > > > > > qemu-devel mailing list. > > > > > > > > > > Best regards, > > > > > Maxim Levitsky > > > > > > > > > > Maxim Levitsky (2): > > > > > KVM: x86: implement KVM_SET_TSC_PRECISE/KVM_GET_TSC_PRECISE > > > > > KVM: x86: introduce KVM_X86_QUIRK_TSC_HOST_ACCESS > > > > > > > > > > Documentation/virt/kvm/api.rst | 56 +++++++++++++++++++++ > > > > > arch/x86/include/uapi/asm/kvm.h | 1 + > > > > > arch/x86/kvm/x86.c | 88 +++++++++++++++++++++++++++++++-- > > > > > include/uapi/linux/kvm.h | 14 ++++++ > > > > > 4 files changed, 154 insertions(+), 5 deletions(-) > > > > > > > > > > -- > > > > > 2.26.2 > > > > > > > > > > > Best regards, > > > Maxim Levitsky > > > > > > Best regards, > Maxim Levitsky