Hi, John Stultz <john.stultz@xxxxxxxxxx> 于2020年6月2日周二 上午4:39写道: > > On Sat, May 30, 2020 at 3:30 AM Jun Li <lijun.kernel@xxxxxxxxx> wrote: > > > > Hi John, > > > > John Stultz <john.stultz@xxxxxxxxxx> 于2020年5月30日周六 下午12:02写道: > > > > > > I've recently (since 5.7-rc1) started noticing very rare hangs > > > pretty early in bootup on my HiKey960 board. > > > > > > They have been particularly difficult to debug, as the system > > > seems to not respond at all to sysrq- commands. However, the > > > system is alive as I'll occaionally see firmware loading timeout > > > errors after awhile. Adding changes like initcall_debug and > > > lockdep weren't informative, as it tended to cause the problem > > > to hide. > > > > > > I finally tried to dig in a bit more on this today, and noticed > > > that the last dmesg output before the hang was usually: > > > "random: crng init done" > > > > > > So I dumped the stack at that point, and saw it was being called > > > from the pl061 gpio irq, and the hang always occurred when the > > > crng init finished on cpu 0. Instrumenting that more I could see > > > that when the issue triggered, we were getting a stream of irqs. > > > > > > Chasing further, I found the screaming irq was for the rt1711h, > > > and narrowed down that we were hitting the !chip->tcpci check > > > which immediately returns IRQ_HANDLED, but does not stop the > > > irq from triggering immediately afterwards. > > > > > > This patch slightly reworks the logic, so if we hit the irq > > > before the chip->tcpci has been assigned, we still read and > > > write the alert register, but just skip calling tcpci_irq(). > > > > > > With this change, I haven't managed to trip over the problem > > > (though it hasn't been super long - but I did confirm I hit > > > the error case and it didn't hang the system). > > > > > > I still have some concern that I don't know why this cropped > > > up since 5.7-rc, as there haven't been any changes to the > > > driver since 5.4 (or before). It may just be the initialization > > > timing has changed due to something else, and its just exposed > > > this issue? I'm not sure, and that's not super re-assuring. > > > > > > Anyway, I'd love to hear your thoughts if this looks like a sane > > > fix or not. > > > > I think a better solution may be move the irq request after port register, > > we should fire the irq after everything is setup. > > does below change works for you? > > Unfortunately the patch didn't seem to apply, but I recreated it by > hand. I agree this looks like it should address the issue and I've not > managed to trigger the problem in my (admittedly somewhat brief) > attempts at testing. > > Thanks for sending it out. Do you want to submit the patch and I'll > provide a Tested-by tag, or would it help for me to submit your > suggested change? OK, I will send out a patch against Greg's tree. Li Jun > > thanks > -john