[ kvm-Bugs-2627272 ] Fix preempt in kvm_write_guest_time and kvm_write_wall_clock

"SourceForge.net" <noreply@xxxxxxxxxxxxxxx> · Sun, 22 Feb 2009 09:55:53 +0000

Bugs item #2627272, was opened at 2009-02-22 04:55
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: kernel
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Matt T. Yourst (yourst)
Assigned to: Nobody/Anonymous (nobody)
Summary: Fix preempt in kvm_write_guest_time and kvm_write_wall_clock

Initial Comment:
This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64) with PREEMPT enabled.

We're getting syslog warnings like this many (but not all) times qemu tells KVM to run the VCPU:

BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-x86/28938
caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit
Call Trace:
  debug_smp_processor_id+0xf7/0x100
  kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
  ? __wake_up+0x4e/0x70
  ? wake_futex+0x27/0x40
  kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm]
  enqueue_hrtimer+0x8a/0x110
  _spin_unlock_irqrestore+0x27/0x50
  vfs_ioctl+0x31/0xa0
  do_vfs_ioctl+0x74/0x480
  sys_futex+0xb4/0x140
  sys_ioctl+0x99/0xa0
  system_call_fastpath+0x16/0x1b

As it turns out, the call trace is messed up due to gcc's inlining, but I isolated the problem anyway: kvm_write_guest_time() is being used in a non-thread-safe manner on preemptable kernels.

Basically kvm_write_guest_time()'s body needs to be surrounded by preempt_disable() and preempt_enable(), since the kernel won't let us query any per-CPU data (indirectly using smp_processor_id()) without preemption disabled. The attached patch fixes this issue by disabling preemption inside
kvm_write_guest_time().

Unfortunately, that's not an ideal solution from a correctness standpoint - it might satisfy the kernel's sanity checks, but if the thread running this code gets preempted and rescheduled onto a different host CPU with a different cpu_tsc_khz between when we read cpu_tsc_khz and when we actually enter the
guest via VT/SVM (i.e. after re-enabling preemption beofre leaving kvm_write_guest_time()), the guest could be using the wrong value of cpu_tsc_khz for the CPU it's actually running on.

This situation can only occur on CPUs without constant_tsc support, where the TSC doesn't always increment at the maximum cpu_tsc_khz, and/or where different cores have asymmetric values of cpu_tsc_khz.

Fortunately all of the current Intel/AMD VT/SVM capable chips implement constant_tsc and all cores always have the TSC frequency, so this is a moot point in practice.

Above and beyond this problem, there's still a related race condition in kvm_write_wall_clock(): why was "static int version" used in something that has to be thread safe if multiple VCPUs do a wall clock update exactly at the same time? We've actually seen this race happen a few times on heavily loaded
systems running kvm-84.

The "version" counter needs to be updated with atomic_add_and_return() for this to work reliably. The attached patch also incorporates this fix.

The solution I used may be overly complex for some real guests, but AFAIK Linux (at least) expects to see two adjacent version sequence numbers, so "atomic_add_and_return(2, &wall_clock_version)) - 2" is required to get the math right, instead of the simpler approach of just doing two separate atomic
increments.

There may still be some similar problems with the new kvm-84 paravirt time related code not being thread safe, but these were the most obvious ones.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html