Re: [PATCH] kvm/x86: Handle async PF in RCU read-side critical sections

Boqun Feng <boqun.feng@xxxxxxxxx> · Sat, 30 Sep 2017 07:41:56 +0800

On Fri, Sep 29, 2017 at 04:43:39PM +0000, Paul E. McKenney wrote:
> On Fri, Sep 29, 2017 at 04:53:57PM +0200, Paolo Bonzini wrote:
> > On 29/09/2017 13:01, Boqun Feng wrote:
> > > Sasha Levin reported a WARNING:
> > > 
> > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > | rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > | WARNING: CPU: 0 PID: 6974 at kernel/rcu/tree_plugin.h:329
> > > | rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > ...
> > > | CPU: 0 PID: 6974 Comm: syz-fuzzer Not tainted 4.13.0-next-20170908+ #246
> > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > | 1.10.1-1ubuntu1 04/01/2014
> > > | Call Trace:
> > > ...
> > > | RIP: 0010:rcu_preempt_note_context_switch kernel/rcu/tree_plugin.h:329 [inline]
> > > | RIP: 0010:rcu_note_context_switch+0x16c/0x2210 kernel/rcu/tree.c:458
> > > | RSP: 0018:ffff88003b2debc8 EFLAGS: 00010002
> > > | RAX: 0000000000000001 RBX: 1ffff1000765bd85 RCX: 0000000000000000
> > > | RDX: 1ffff100075d7882 RSI: ffffffffb5c7da20 RDI: ffff88003aebc410
> > > | RBP: ffff88003b2def30 R08: dffffc0000000000 R09: 0000000000000001
> > > | R10: 0000000000000000 R11: 0000000000000000 R12: ffff88003b2def08
> > > | R13: 0000000000000000 R14: ffff88003aebc040 R15: ffff88003aebc040
> > > | __schedule+0x201/0x2240 kernel/sched/core.c:3292
> > > | schedule+0x113/0x460 kernel/sched/core.c:3421
> > > | kvm_async_pf_task_wait+0x43f/0x940 arch/x86/kernel/kvm.c:158
> > > | do_async_page_fault+0x72/0x90 arch/x86/kernel/kvm.c:271
> > > | async_page_fault+0x22/0x30 arch/x86/entry/entry_64.S:1069
> > > | RIP: 0010:format_decode+0x240/0x830 lib/vsprintf.c:1996
> > > | RSP: 0018:ffff88003b2df520 EFLAGS: 00010283
> > > | RAX: 000000000000003f RBX: ffffffffb5d1e141 RCX: ffff88003b2df670
> > > | RDX: 0000000000000001 RSI: dffffc0000000000 RDI: ffffffffb5d1e140
> > > | RBP: ffff88003b2df560 R08: dffffc0000000000 R09: 0000000000000000
> > > | R10: ffff88003b2df718 R11: 0000000000000000 R12: ffff88003b2df5d8
> > > | R13: 0000000000000064 R14: ffffffffb5d1e140 R15: 0000000000000000
> > > | vsnprintf+0x173/0x1700 lib/vsprintf.c:2136
> > > | sprintf+0xbe/0xf0 lib/vsprintf.c:2386
> > > | proc_self_get_link+0xfb/0x1c0 fs/proc/self.c:23
> > > | get_link fs/namei.c:1047 [inline]
> > > | link_path_walk+0x1041/0x1490 fs/namei.c:2127
> > > ...
> > > 
> > > And this happened when we hit a page fault in an RCU read-side critical
> > > section and then we tried to reschedule in kvm_async_pf_task_wait(),
> > > this reschedule would hit the WARN in rcu_preempt_note_context_switch(),
> > > and be treated as a sleep in RCU read-side critical section, which is
> > > not allowed(even in preemptible RCU).
> > 
> > Just a small fix to the commit message:
> > 
> > This happened when the host hit a page fault, and delivered it as in an
> > async page fault, while the guest was in an RCU read-side critical
> > section.  The guest then tries to reschedule in kvm_async_pf_task_wait(),
> > but rcu_preempt_note_context_switch() would treat the reschedule as a
> > sleep in RCU read-side critical section, which is not allowed (even in
> > preemptible RCU).  Thus the WARN.
> > 
> > Queued with that change, thanks.
> 
> Not to be repetitive, but if the schedule() is on the guest, this change
> really does silently break up an RCU read-side critical section on
> guests built with PREEMPT=n.  (Yes, they were already being broken,
> but it would be good to avoid this breakage in PREEMPT=n as well as
> in PREEMPT=y.)
> 

Then probably adding !IS_ENABLED(CONFIG_PREEMPT) as one of the reason we
choose the halt path? Like:

	n.halted = is_idle_task(current) || preempt_count() > 1 ||
		   !IS_ENABLED(CONFIG_PREEMPT) || rcu_preempt_depth();

But I think async PF could also happen while a user program is running?
Then maybe add a second parameter @user for kvm_async_pf_task_wait(),
like:

	kvm_async_pf_task_wait((u32)read_cr2(), user_mode(regs));

and the halt condition becomes:

	n.halted = is_idle_task(current) || preempt_count() > 1 ||
		   (!IS_ENABLED(CONFIG_PREEMPT) && !user) || rcu_preempt_depth();

Thoughts?

A side thing is being broken already for PREEMPT=n means we maybe fail
to detect this in rcutorture? Then should we add a config with
KVM_GUEST=y and try to run some memory consuming things(e.g. stress
--vm) in the rcutorture kvm script simultaneously? Paolo, do you have
any test workload that could trigger async PF quickly?

Regards,
Boqun

> 							Thanx, Paul
> 
> > Paolo
> > 
> > > To cure this, make kvm_async_pf_task_wait() go to the halt path if the
> > > PF happens in a RCU read-side critical section.
> > > 
> > > Reported-by: Sasha Levin <levinsasha928@xxxxxxxxx>
> > > Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxxxxxxxxxx>
> > > Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> > > Signed-off-by: Boqun Feng <boqun.feng@xxxxxxxxx>
> > > ---
> > >  arch/x86/kernel/kvm.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > > index aa60a08b65b1..e675704fa6f7 100644
> > > --- a/arch/x86/kernel/kvm.c
> > > +++ b/arch/x86/kernel/kvm.c
> > > @@ -140,7 +140,8 @@ void kvm_async_pf_task_wait(u32 token)
> > >  
> > >  	n.token = token;
> > >  	n.cpu = smp_processor_id();
> > > -	n.halted = is_idle_task(current) || preempt_count() > 1;
> > > +	n.halted = is_idle_task(current) || preempt_count() > 1 ||
> > > +		   rcu_preempt_depth();
> > >  	init_swait_queue_head(&n.wq);
> > >  	hlist_add_head(&n.link, &b->list);
> > >  	raw_spin_unlock(&b->lock);
> > > 
> > 
> 
Attachment:
signature.asc

Description: PGP signature