Re: [PATCH] ALSA: seq: Fix RCU stall in snd_seq_write()

Zqiang <qiang.zhang1211@xxxxxxxxx> · Tue, 2 Nov 2021 17:41:57 +0800

On 2021/11/2 下午4:33, Takashi Iwai wrote:
On Tue, 02 Nov 2021 04:32:22 +0100,
Zqiang wrote:
If we have a lot of cell object, this cycle may take a long time, and
trigger RCU stall. insert a conditional reschedule point to fix it.

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	1-....: (1 GPs behind) idle=9f5/1/0x4000000000000000
	softirq=16474/16475 fqs=4916
	(t=10500 jiffies g=19249 q=192515)
NMI backtrace for cpu 1
......
asm_sysvec_apic_timer_interrupt
RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70
spin_unlock_irqrestore
snd_seq_prioq_cell_out+0x1dc/0x360
snd_seq_check_queue+0x1a6/0x3f0
snd_seq_enqueue_event+0x1ed/0x3e0
snd_seq_client_enqueue_event.constprop.0+0x19a/0x3c0
snd_seq_write+0x2db/0x510
vfs_write+0x1c4/0x900
ksys_write+0x171/0x1d0
do_syscall_64+0x35/0xb0

Reported-by: syzbot+bb950e68b400ab4f65f8@xxxxxxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Zqiang <qiang.zhang1211@xxxxxxxxx>
---
  sound/core/seq/seq_queue.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/sound/core/seq/seq_queue.c b/sound/core/seq/seq_queue.c
index d6c02dea976c..f5b1e4562a64 100644
--- a/sound/core/seq/seq_queue.c
+++ b/sound/core/seq/seq_queue.c
@@ -263,6 +263,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
  		if (!cell)
  			break;
  		snd_seq_dispatch_event(cell, atomic, hop);
+		cond_resched();
  	}
  
  	/* Process time queue... */
@@ -272,6 +273,7 @@ void snd_seq_check_queue(struct snd_seq_queue *q, int atomic, int hop)
  		if (!cell)
  			break;
  		snd_seq_dispatch_event(cell, atomic, hop);
+		cond_resched();

It's good to have cond_resched() in those places but it must be done
more carefully, as the code path may be called from the non-atomic
context, too.  That is, it must have a check of atomic argument, and
cond_resched() is applied only when atomic==false.

But I still wonder how this gets a RCU stall out of sudden.  Looking
through https://syzkaller.appspot.com/bug?extid=bb950e68b400ab4f65f8
it's triggered by many cases since the end of September...

I did not find useful information from the log,  through calltrace, I 
guess it may be triggered by the long cycle time, which caused the 
static state of the RCU to

not be reported in time.

I ignore the atomic parameter check,  I will resend v2 .   in no-atomic 
context, we can insert

cond_resched() to avoid this situation, but in atomic context,

the RCU stall maybe still trigger.

thanks
Zqiang



thanks,

Takashi