On Fri, Jan 12, 2024 at 03:03:08PM -0800, Douglas Anderson wrote: > As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and > IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI > enabled then we'll use it to stop CPUs at panic time. This is nice, > but it does mean that there's a pretty good chance that we'll end up > stopping a CPU while it holds the port lock for the console > UART. Specifically, I see a CPU get stopped while holding the port > lock nearly 100% of the time on my sc7180-trogdor based Chromebook by > enabling the "buddy" hardlockup detector and then doing: > > sysctl -w kernel.hardlockup_all_cpu_backtrace=1 > sysctl -w kernel.hardlockup_panic=1 > echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT > > UART drivers are _supposed_ to handle this case OK and this is why > UART drivers check "oops_in_progress" and only do a "trylock" in that > case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a > very well-tested situation. > > Now that we're testing the situation a lot, it can be seen that the > Qualcomm GENI UART driver is pretty broken. Specifically, when I run > my test case and look at the console output I just see a bunch of > garbled output like: > > [ 201.069084] NMI backtrace[ 201.069084] NM[ 201.069087] CPU: 6 > PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12 > #1 01112b9f14923cbd0b[ 201.069090] Hardware name: Google Lazor > ([ 201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[ > 201.069095] pc : smp_call_function_man[ 201.069099] > > That's obviously not so great. This happens because each call to the > console driver exits after the data has been written to the FIFO but > before it's actually been flushed out of the serial port. When we have > multiple calls into the console one after the other then (if we can't > get the lock) each call tells the UART to throw away any data in the > FIFO that hadn't been transferred yet. > > I've posted up a patch to change the arm64 core to avoid this > situation most of the time [1] much like x86 seems to do, but even if > that patch lands the GENI driver should still be fixed. > > From testing, it appears that we can just delete the cancel/abort in > the case where we weren't able to get the UART lock and the output > looks good. It makes sense that we'd be able to do this since that > means we'll just call into __qcom_geni_serial_console_write() and > __qcom_geni_serial_console_write() looks much like > qcom_geni_serial_poll_put_char() but with a loop. However, it seems > safest to poll the FIFO and make sure it's empty before our > transfer. This should reliably make sure that we're not > interrupting/clobbering any existing transfers. > > As part of this change, we'll also avoid re-setting up a TX at the end > of the console write function if we weren't able to get the lock, > since accessing "port->tx_remaining" without the lock is not > safe. This is only needed to re-start userspace initiated transfers. > > [1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid > > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx> Reviewed-by: Bjorn Andersson <andersson@xxxxxxxxxx> Regards, Bjorn > --- > > drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++-------------- > 1 file changed, 13 insertions(+), 14 deletions(-) > > diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c > index 7e78f97e8f43..06ebe62f99bc 100644 > --- a/drivers/tty/serial/qcom_geni_serial.c > +++ b/drivers/tty/serial/qcom_geni_serial.c > @@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s, > > geni_status = readl(uport->membase + SE_GENI_STATUS); > > - /* Cancel the current write to log the fault */ > if (!locked) { > - geni_se_cancel_m_cmd(&port->se); > - if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, > - M_CMD_CANCEL_EN, true)) { > - geni_se_abort_m_cmd(&port->se); > - qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, > - M_CMD_ABORT_EN, true); > - writel(M_CMD_ABORT_EN, uport->membase + > - SE_GENI_M_IRQ_CLEAR); > - } > - writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR); > + /* > + * We can only get here if an oops is in progress then we were > + * unable to get the lock. This means we can't safely access > + * our state variables like tx_remaining. About the best we > + * can do is wait for the FIFO to be empty before we start our > + * transfer, so we'll do that. > + */ > + qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS, > + M_TX_FIFO_NOT_EMPTY_EN, false); > } else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) { > /* > * It seems we can't interrupt existing transfers if all data > @@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s, > > __qcom_geni_serial_console_write(uport, s, count); > > - if (port->tx_remaining) > - qcom_geni_serial_setup_tx(uport, port->tx_remaining); > > - if (locked) > + if (locked) { > + if (port->tx_remaining) > + qcom_geni_serial_setup_tx(uport, port->tx_remaining); > uart_port_unlock_irqrestore(uport, flags); > + } > } > > static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop) > -- > 2.43.0.275.g3460e3d667-goog >