Bugs item #2627272, was opened at 2009-02-22 04:55 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: kernel Group: None Status: Open Resolution: None Priority: 5 Private: No Submitted By: Matt T. Yourst (yourst) Assigned to: Nobody/Anonymous (nobody) Summary: Fix preempt in kvm_write_guest_time and kvm_write_wall_clock Initial Comment: This issue just appeared in kvm-84 when running on 2.6.28.7 (x86-64) with PREEMPT enabled. We're getting syslog warnings like this many (but not all) times qemu tells KVM to run the VCPU: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-system-x86/28938 caller is kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] Pid: 28938, comm: qemu-system-x86 2.6.28.7-mtyrel-64bit Call Trace: debug_smp_processor_id+0xf7/0x100 kvm_arch_vcpu_ioctl_run+0x5d1/0xc70 [kvm] ? __wake_up+0x4e/0x70 ? wake_futex+0x27/0x40 kvm_vcpu_ioctl+0x2e9/0x5a0 [kvm] enqueue_hrtimer+0x8a/0x110 _spin_unlock_irqrestore+0x27/0x50 vfs_ioctl+0x31/0xa0 do_vfs_ioctl+0x74/0x480 sys_futex+0xb4/0x140 sys_ioctl+0x99/0xa0 system_call_fastpath+0x16/0x1b As it turns out, the call trace is messed up due to gcc's inlining, but I isolated the problem anyway: kvm_write_guest_time() is being used in a non-thread-safe manner on preemptable kernels. Basically kvm_write_guest_time()'s body needs to be surrounded by preempt_disable() and preempt_enable(), since the kernel won't let us query any per-CPU data (indirectly using smp_processor_id()) without preemption disabled. The attached patch fixes this issue by disabling preemption inside kvm_write_guest_time(). Unfortunately, that's not an ideal solution from a correctness standpoint - it might satisfy the kernel's sanity checks, but if the thread running this code gets preempted and rescheduled onto a different host CPU with a different cpu_tsc_khz between when we read cpu_tsc_khz and when we actually enter the guest via VT/SVM (i.e. after re-enabling preemption beofre leaving kvm_write_guest_time()), the guest could be using the wrong value of cpu_tsc_khz for the CPU it's actually running on. This situation can only occur on CPUs without constant_tsc support, where the TSC doesn't always increment at the maximum cpu_tsc_khz, and/or where different cores have asymmetric values of cpu_tsc_khz. Fortunately all of the current Intel/AMD VT/SVM capable chips implement constant_tsc and all cores always have the TSC frequency, so this is a moot point in practice. Above and beyond this problem, there's still a related race condition in kvm_write_wall_clock(): why was "static int version" used in something that has to be thread safe if multiple VCPUs do a wall clock update exactly at the same time? We've actually seen this race happen a few times on heavily loaded systems running kvm-84. The "version" counter needs to be updated with atomic_add_and_return() for this to work reliably. The attached patch also incorporates this fix. The solution I used may be overly complex for some real guests, but AFAIK Linux (at least) expects to see two adjacent version sequence numbers, so "atomic_add_and_return(2, &wall_clock_version)) - 2" is required to get the math right, instead of the simpler approach of just doing two separate atomic increments. There may still be some similar problems with the new kvm-84 paravirt time related code not being thread safe, but these were the most obvious ones. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=893831&aid=2627272&group_id=180599 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html