On Tue, Aug 17, 2021 at 06:47:45PM +0000, David Chen wrote: > > > > -----Original Message----- > > From: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > > Sent: Monday, August 16, 2021 11:16 PM > > To: David Chen <david.chen@xxxxxxxxxxx> > > Cc: stable@xxxxxxxxxxxxxxx; Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>; neeraju@xxxxxxxxxxxxxx > > Subject: Re: Request for backport fd6bc19d7676 to 4.14 and 4.19 branch > > > > On Mon, Aug 16, 2021 at 10:02:28PM +0000, David Chen wrote: > > > > > > > > > > -----Original Message----- > > > > From: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> > > > > Sent: Monday, August 16, 2021 12:31 PM > > > > To: David Chen <david.chen@xxxxxxxxxxx> > > > > Cc: stable@xxxxxxxxxxxxxxx; Paul E. McKenney > > > > <paulmck@xxxxxxxxxxxxxxxxxx>; neeraju@xxxxxxxxxxxxxx > > > > Subject: Re: Request for backport fd6bc19d7676 to 4.14 and 4.19 branch > > > > > > > > On Mon, Aug 16, 2021 at 07:19:34PM +0000, David Chen wrote: > > > > > Hi Greg, > > > > > > > > > > We recently hit a hung task timeout issue in synchronize_rcu_expedited on > > > > 4.14 branch. > > > > > The issue seems to be identical to the one described in `fd6bc19d7676 > > > > > rcu: Fix missed wakeup of exp_wq waiters` Can we backport it to 4.14 and > > > > 4.19 branch? > > > > > The patch doesn't apply cleanly, but it should be trivial to resolve, > > > > > just do this > > > > > > > > > > - wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rsp- > > > > >expedited_sequence) & 0x3]); > > > > > + wake_up_all(&rnp->exp_wq[rcu_seq_ctr(s) & 0x3]); > > > > > > > > > > I don't know if we should do it for 4.9, because the handling of sequence > > > > number is a bit different. > > > > > > > > Please provide a working backport, me hand-editing patches does not scale, > > > > and this way you get the proper credit for backporting it (after testing it). > > > > > > Sure, appended at the end. > > > > > > > > > > > You have tested, this, right? > > > > > > I don't have a good repro for the original issue, so I only ran rcutorture and > > > some basic work load test to see if anything obvious went wrong. > > > > Ideally you would be able to also hit this without the patch on the > > older kernels, is this the case? > > > So far we've only seen this once. I was able to figure out the issue from the vmcore, > but I haven't been able to reproduce this. I think the nature of the bug makes it > very difficult to hit. It requires a race with synchronize_rcu_expedited but once > the thread hangs, you can't call it again, because it might rescue the hung thread. I would like a bit more verification that this is really needed, and some acks from the developers/maintainers involved, before accepting this change. thanks, greg k-h