[ kvm-Bugs-2627272 ] Fix preempt in kvm_write_guest_time and kvm_write_wall_clock

"SourceForge.net" <noreply@xxxxxxxxxxxxxxx> · Mon, 18 May 2009 22:40:30 +0000

Bugs item #2627272, was opened at 2009-02-22 05:55
Message generated for change (Comment added) made by mtosatti
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: kernel
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: Matt T. Yourst (yourst)
Assigned to: Nobody/Anonymous (nobody)
Summary: Fix preempt in kvm_write_guest_time and kvm_write_wall_clock

Initial Comment:
This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64) with PREEMPT enabled.

We're getting syslog warnings like this many (but not all) times qemu tells KVM to run the VCPU:

BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-x86/28938
caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit
Call Trace:
  debug_smp_processor_id+0xf7/0x100
  kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm]
  ? __wake_up+0x4e/0x70
  ? wake_futex+0x27/0x40
  kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm]
  enqueue_hrtimer+0x8a/0x110
  _spin_unlock_irqrestore+0x27/0x50
  vfs_ioctl+0x31/0xa0
  do_vfs_ioctl+0x74/0x480
  sys_futex+0xb4/0x140
  sys_ioctl+0x99/0xa0
  system_call_fastpath+0x16/0x1b

As it turns out, the call trace is messed up due to gcc's inlining, but I isolated the problem anyway: kvm_write_guest_time() is being used in a non-thread-safe manner on preemptable kernels.

Basically kvm_write_guest_time()'s body needs to be surrounded by preempt_disable() and preempt_enable(), since the kernel won't let us query any per-CPU data (indirectly using smp_processor_id()) without preemption disabled. The attached patch fixes this issue by disabling preemption inside
kvm_write_guest_time().

Unfortunately, that's not an ideal solution from a correctness standpoint - it might satisfy the kernel's sanity checks, but if the thread running this code gets preempted and rescheduled onto a different host CPU with a different cpu_tsc_khz between when we read cpu_tsc_khz and when we actually enter the
guest via VT/SVM (i.e. after re-enabling preemption beofre leaving kvm_write_guest_time()), the guest could be using the wrong value of cpu_tsc_khz for the CPU it's actually running on.

This situation can only occur on CPUs without constant_tsc support, where the TSC doesn't always increment at the maximum cpu_tsc_khz, and/or where different cores have asymmetric values of cpu_tsc_khz.

Fortunately all of the current Intel/AMD VT/SVM capable chips implement constant_tsc and all cores always have the TSC frequency, so this is a moot point in practice.

Above and beyond this problem, there's still a related race condition in kvm_write_wall_clock(): why was "static int version" used in something that has to be thread safe if multiple VCPUs do a wall clock update exactly at the same time? We've actually seen this race happen a few times on heavily loaded
systems running kvm-84.

The "version" counter needs to be updated with atomic_add_and_return() for this to work reliably. The attached patch also incorporates this fix.

The solution I used may be overly complex for some real guests, but AFAIK Linux (at least) expects to see two adjacent version sequence numbers, so "atomic_add_and_return(2, &wall_clock_version)) - 2" is required to get the math right, instead of the simpler approach of just doing two separate atomic
increments.

There may still be some similar problems with the new kvm-84 paravirt time related code not being thread safe, but these were the most obvious ones.

----------------------------------------------------------------------

>Comment By: Marcelo Tosatti (mtosatti)
Date: 2009-05-18 18:40

Message:
Fixed, thanks for the report.

----------------------------------------------------------------------

Comment By: Glauber de Oliveira Costa (glommer)
Date: 2009-02-25 17:40

Message:
It does not matter if the version is increasing or decreasing. Only if it's
odd/even.

Also, note that wall_clock happens ONCE at boot time. So it should not be
prone to the 
race condition that you described. Are you sure you've seen this race, and
not something else?
In kvm_write_guest_time, we have to be much more careful.

----------------------------------------------------------------------

Comment By: Marcelo Tosatti (mtosatti)
Date: 2009-02-22 15:20

Message:
Hi Matt,

> Basically kvm_write_guest_time()'s body needs to be surrounded by
> preempt_disable() and preempt_enable(), since the kernel won't let us
query
> any per-CPU data (indirectly using smp_processor_id()) without
preemption
> disabled. The attached patch fixes this issue by disabling preemption
> inside kvm_write_guest_time().

Can you submit a separate patch to kvm-devel@xxxxxxxxxxxxxxx to surround
kvm_write_guest_time with preempt_disable/preempt_enable ? See below.

> Unfortunately, that's not an ideal solution from a correctness
standpoint -
> it might satisfy the kernel's sanity checks, but if the thread running
this
> code gets preempted and rescheduled onto a different host CPU with a
> different cpu_tsc_khz between when we read cpu_tsc_khz and when we
actually
> enter the guest via VT/SVM (i.e. after re-enabling preemption beofre
leaving
> kvm_write_guest_time()), the guest could be using the wrong value of
> cpu_tsc_khz for the CPU it's actually running on.

The KVM_REQ_KVMCLOCK_UPDATE bit will be set in vcpu->requests if the
frequency
changes during this window, so its safe (or at least supposed to be).

> Above and beyond this problem, there's still a related race condition
in
> kvm_write_wall_clock(): why was "static int version" used in something
that
> has to be thread safe if multiple VCPUs do a wall clock update exactly
at
> the same time? We've actually seen this race happen a few times on
heavily
> loaded systems running kvm-84.

I guess that all that matters to the guest are increasing values. But it
does looks racy.
In case it is not racy, it should at least be explicit on _why_. Glauber,
can you comment 
on this?

Matt, what was the symptom you experienced with kvm-84? Can you please
provide
more details?

Thanks 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html