Bugs item #2627272, was opened at 2009-02-22 09:55 Message generated for change (Comment added) made by glommer You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: kernel Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Matt T. Yourst (yourst) Assigned to: Nobody/Anonymous (nobody) Summary: Fix preempt in kvm_write_guest_time and kvm_write_wall_clock Initial Comment: This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64) with PREEMPT enabled. We're getting syslog warnings like this many (but not all) times qemu tells KVM to run the VCPU: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-x86/28938 caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit Call Trace: debug_smp_processor_id+0xf7/0x100 kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] ? __wake_up+0x4e/0x70 ? wake_futex+0x27/0x40 kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm] enqueue_hrtimer+0x8a/0x110 _spin_unlock_irqrestore+0x27/0x50 vfs_ioctl+0x31/0xa0 do_vfs_ioctl+0x74/0x480 sys_futex+0xb4/0x140 sys_ioctl+0x99/0xa0 system_call_fastpath+0x16/0x1b As it turns out, the call trace is messed up due to gcc's inlining, but I isolated the problem anyway: kvm_write_guest_time() is being used in a non-thread-safe manner on preemptable kernels. Basically kvm_write_guest_time()'s body needs to be surrounded by preempt_disable() and preempt_enable(), since the kernel won't let us query any per-CPU data (indirectly using smp_processor_id()) without preemption disabled. The attached patch fixes this issue by disabling preemption inside kvm_write_guest_time(). Unfortunately, that's not an ideal solution from a correctness standpoint - it might satisfy the kernel's sanity checks, but if the thread running this code gets preempted and rescheduled onto a different host CPU with a different cpu_tsc_khz between when we read cpu_tsc_khz and when we actually enter the guest via VT/SVM (i.e. after re-enabling preemption beofre leaving kvm_write_guest_time()), the guest could be using the wrong value of cpu_tsc_khz for the CPU it's actually running on. This situation can only occur on CPUs without constant_tsc support, where the TSC doesn't always increment at the maximum cpu_tsc_khz, and/or where different cores have asymmetric values of cpu_tsc_khz. Fortunately all of the current Intel/AMD VT/SVM capable chips implement constant_tsc and all cores always have the TSC frequency, so this is a moot point in practice. Above and beyond this problem, there's still a related race condition in kvm_write_wall_clock(): why was "static int version" used in something that has to be thread safe if multiple VCPUs do a wall clock update exactly at the same time? We've actually seen this race happen a few times on heavily loaded systems running kvm-84. The "version" counter needs to be updated with atomic_add_and_return() for this to work reliably. The attached patch also incorporates this fix. The solution I used may be overly complex for some real guests, but AFAIK Linux (at least) expects to see two adjacent version sequence numbers, so "atomic_add_and_return(2, &wall_clock_version)) - 2" is required to get the math right, instead of the simpler approach of just doing two separate atomic increments. There may still be some similar problems with the new kvm-84 paravirt time related code not being thread safe, but these were the most obvious ones. ---------------------------------------------------------------------- Comment By: Glauber de Oliveira Costa (glommer) Date: 2009-02-25 21:40 Message: It does not matter if the version is increasing or decreasing. Only if it's odd/even. Also, note that wall_clock happens ONCE at boot time. So it should not be prone to the race condition that you described. Are you sure you've seen this race, and not something else? In kvm_write_guest_time, we have to be much more careful. ---------------------------------------------------------------------- Comment By: Marcelo Tosatti (mtosatti) Date: 2009-02-22 19:20 Message: Hi Matt, > Basically kvm_write_guest_time()'s body needs to be surrounded by > preempt_disable() and preempt_enable(), since the kernel won't let us query > any per-CPU data (indirectly using smp_processor_id()) without preemption > disabled. The attached patch fixes this issue by disabling preemption > inside kvm_write_guest_time(). Can you submit a separate patch to kvm-devel@xxxxxxxxxxxxxxx to surround kvm_write_guest_time with preempt_disable/preempt_enable ? See below. > Unfortunately, that's not an ideal solution from a correctness standpoint - > it might satisfy the kernel's sanity checks, but if the thread running this > code gets preempted and rescheduled onto a different host CPU with a > different cpu_tsc_khz between when we read cpu_tsc_khz and when we actually > enter the guest via VT/SVM (i.e. after re-enabling preemption beofre leaving > kvm_write_guest_time()), the guest could be using the wrong value of > cpu_tsc_khz for the CPU it's actually running on. The KVM_REQ_KVMCLOCK_UPDATE bit will be set in vcpu->requests if the frequency changes during this window, so its safe (or at least supposed to be). > Above and beyond this problem, there's still a related race condition in > kvm_write_wall_clock(): why was "static int version" used in something that > has to be thread safe if multiple VCPUs do a wall clock update exactly at > the same time? We've actually seen this race happen a few times on heavily > loaded systems running kvm-84. I guess that all that matters to the guest are increasing values. But it does looks racy. In case it is not racy, it should at least be explicit on _why_. Glauber, can you comment on this? Matt, what was the symptom you experienced with kvm-84? Can you please provide more details? Thanks ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html