Hi, On Mon, Jun 24, 2024 at 5:12 AM Johan Hovold <johan@xxxxxxxxxx> wrote: > > On Mon, Jun 10, 2024 at 03:24:25PM -0700, Douglas Anderson wrote: > > On devices using Qualcomm's GENI UART it is possible to get the UART > > stuck such that it no longer outputs data. Specifically, logging in > > via an agetty on the debug serial port (which was _not_ used for > > kernel console) and running: > > cat /var/log/messages > > ...and then (via an SSH session) forcing a few suspend/resume cycles > > causes the UART to stop transmitting. > > An easier way to trigger this old bug is to just run a command like > dmesg and hit ctrl-s in a serial console to stop tx. Interrupting the > command or hitting ctrl-q to restart tx then triggers the soft lockup. > > > The root of the problems was with qcom_geni_serial_stop_tx_fifo() > > which is called as part of the suspend process. Specific problems with > > that function: > > - When an in-progress "tx" command is cancelled it doesn't appear to > > fully drain the FIFO. That meant qcom_geni_serial_tx_empty() > > continued to report that the FIFO wasn't empty. The > > qcom_geni_serial_start_tx_fifo() function didn't re-enable > > interrupts in this case so the driver would never start transferring > > again. > > - When the driver cancelled the current "tx" command but it forgot to > > zero out "tx_remaining". This confused logic elsewhere in the > > driver. > > - From experimentation, it appears that cancelling the "tx" command > > could drop some of the queued up bytes. > > > > While qcom_geni_serial_stop_tx_fifo() could be fixed to drain the FIFO > > and shut things down properly, stop_tx() isn't supposed to be a slow > > function. It is run with local interrupts off and is documented to > > stop transmitting "as soon as possible". Change the function to just > > stop new bytes from being queued. In order to make this work, change > > qcom_geni_serial_start_tx_fifo() to remove some conditions. It's > > always safe to enable the watermark interrupt and the IRQ handler will > > disable it if it's not needed. > > > > For system suspend the queue still needs to be drained. Failure to do > > so means that the hardware won't provide new interrupts until a > > "cancel" command is sent. Add draining logic (fixing the issues noted > > above) at suspend time. > > So I spent the better part of the weekend looking at this driver and > this is one of the bits I worry about with your approach as relying on > draining anything won't work with hardware flow control. > > Cancelling commands can result stalled TX in a number of ways and > there's still at least one that you don't handle. If you end up with > data in in the FIFO, the watermark interrupt may never fire when you try > to restart tx. Ah, that's a good call. Right now it doesn't really happen since people tend to hook up the debug UART without flow control lines (as far as I've seen), but it's good to make sure it works. > I'm leaning towards fixing the immediate hard lockup regression > separately and then we can address the older bugs and rework driver > without having to rush things. Yeah, that's fair. I've responded to your patch with a counter-proposal to fix the hard lockup regression, but I agree that should take priority. > I've prepared a minimal three patch series which fixes most of the > discussed issues (hard and soft lockup and garbage characters) and that > should be backportable as well. > > Currently, the diffstat is just: > > drivers/tty/serial/qcom_geni_serial.c | 36 +++++++++++++++++++++++++----------- > 1 file changed, 25 insertions(+), 11 deletions(-) I'll respond more in dept to your patches, but I suspect that your patch series won't fix the issues that Nícolas reported [1]. I also tested and your patch series doesn't fix the kdb issue talked about in my patch #8. Part of my reworking of stuff also changed the way that the console and the polling commands worked since they were pretty broken. Your series doesn't touch them. We'll probably need something in-between taking advantage of some of the stuff you figured out with "cancel" but also doing a bigger rework than you did. [1] https://lore.kernel.org/r/46f57349-1217-4594-85b2-84fa3a365c0c@notapiano