I'd prefer a slight change in subject and topic: ------- 8< -------- Subject: [PATCH] kvm/x86: Avoid async PF preempting the kernel incorrectly Currently, in PREEMPT_COUNT=n kernel, kvm_async_pf_task_wait() could call schedule() to reschedule in some cases. This could result in accidentally ending the current RCU read-side critical section early, causing random memory corruption in the guest, or otherwise preempting the currently running task inside between preempt_disable and preempt_enable. The difficulty to handle this well is because we don't know whether an async PF delivered in a preemptible section or RCU read-side critical section for PREEMPT_COUNT=n, since preempt_disable()/enable() and rcu_read_lock/unlock() are both no-ops in that case. To cure this, we treat any async PF interrupting a kernel context as one that cannot be preempted, preventing kvm_async_pf_task_wait() from choosing the schedule() path in that case. To do so, a second parameter for kvm_async_pf_task_wait() is introduced, so that we know whether it's called from a context interrupting the kernel, and the parameter is set properly in all the callsites. Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> Cc: stable@xxxxxxxxxxxxxxx Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx> ------- 8< -------- I'll let Radim pick up the patch. Paolo On 03/10/2017 15:36, Boqun Feng wrote: > Currently, in PREEMPT=n kernel, kvm_async_pf_task_wait() could call > schedule() to reschedule in some cases, which could result in > accidentally ending the current RCU read-side critical section early. > And this could end up with random memory corruption in the guest. > > The difficulty to handle this well is because we don't know whether an > async PF delivered in a RCU read-side critical section for > PREEMPT_COUNT=n kernel, since rcu_read_lock/unlock() are just no-ops in > that case. > > To cure this, we treat any async PF interrupting a kernel context as one > delivered in a RCU read-side critical section, and we don't allow > kvm_async_pf_task_wait() to choose schedule path in that case for > PREEMPT_COUNT=n kernel, because that will introduce unvolunteerly > context switches and break the assumption for RCU to work properly. > > To do so, a second parameter for kvm_async_pf_task_wait() is introduced, > so that we know whether it's called from a context interrupting the > kernel, and we set that parameter properly in all the callsites. > > Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx> > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> > Cc: Wanpeng Li <wanpeng.li@xxxxxxxxxxx> > Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx> > --- > arch/x86/include/asm/kvm_para.h | 4 ++-- > arch/x86/kernel/kvm.c | 14 ++++++++++---- > arch/x86/kvm/mmu.c | 2 +- > 3 files changed, 13 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h > index bc62e7cbf1b1..59ad3d132353 100644 > --- a/arch/x86/include/asm/kvm_para.h > +++ b/arch/x86/include/asm/kvm_para.h > @@ -88,7 +88,7 @@ static inline long kvm_hypercall4(unsigned int nr, unsigned long p1, > bool kvm_para_available(void); > unsigned int kvm_arch_para_features(void); > void __init kvm_guest_init(void); > -void kvm_async_pf_task_wait(u32 token); > +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel); > void kvm_async_pf_task_wake(u32 token); > u32 kvm_read_and_reset_pf_reason(void); > extern void kvm_disable_steal_time(void); > @@ -103,7 +103,7 @@ static inline void kvm_spinlock_init(void) > > #else /* CONFIG_KVM_GUEST */ > #define kvm_guest_init() do {} while (0) > -#define kvm_async_pf_task_wait(T) do {} while(0) > +#define kvm_async_pf_task_wait(T, I) do {} while(0) > #define kvm_async_pf_task_wake(T) do {} while(0) > > static inline bool kvm_para_available(void) > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index e675704fa6f7..8bb9594d0761 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -117,7 +117,11 @@ static struct kvm_task_sleep_node *_find_apf_task(struct kvm_task_sleep_head *b, > return NULL; > } > > -void kvm_async_pf_task_wait(u32 token) > +/* > + * @interrupt_kernel: Is this called from a routine which interrupts the kernel > + * (other than user space)? > + */ > +void kvm_async_pf_task_wait(u32 token, int interrupt_kernel) > { > u32 key = hash_32(token, KVM_TASK_SLEEP_HASHBITS); > struct kvm_task_sleep_head *b = &async_pf_sleepers[key]; > @@ -140,8 +144,10 @@ void kvm_async_pf_task_wait(u32 token) > > n.token = token; > n.cpu = smp_processor_id(); > - n.halted = is_idle_task(current) || preempt_count() > 1 || > - rcu_preempt_depth(); > + n.halted = is_idle_task(current) || > + (IS_ENABLED(CONFIG_PREEMPT_COUNT) > + ? preempt_count() > 1 || rcu_preempt_depth() > + : interrupt_kernel); > init_swait_queue_head(&n.wq); > hlist_add_head(&n.link, &b->list); > raw_spin_unlock(&b->lock); > @@ -269,7 +275,7 @@ do_async_page_fault(struct pt_regs *regs, unsigned long error_code) > case KVM_PV_REASON_PAGE_NOT_PRESENT: > /* page is swapped out by the host. */ > prev_state = exception_enter(); > - kvm_async_pf_task_wait((u32)read_cr2()); > + kvm_async_pf_task_wait((u32)read_cr2(), !user_mode(regs)); > exception_exit(prev_state); > break; > case KVM_PV_REASON_PAGE_READY: > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index eca30c1eb1d9..106d4a029a8a 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -3837,7 +3837,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code, > case KVM_PV_REASON_PAGE_NOT_PRESENT: > vcpu->arch.apf.host_apf_reason = 0; > local_irq_disable(); > - kvm_async_pf_task_wait(fault_address); > + kvm_async_pf_task_wait(fault_address, 0); > local_irq_enable(); > break; > case KVM_PV_REASON_PAGE_READY: >