Re: [PATCH 2/2] serial: qcom-geni: Don't cancel/abort if we can't get the port lock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 12, 2024 at 03:03:08PM -0800, Douglas Anderson wrote:
> As of commit d7402513c935 ("arm64: smp: IPI_CPU_STOP and
> IPI_CPU_CRASH_STOP should try for NMI"), if we've got pseudo-NMI
> enabled then we'll use it to stop CPUs at panic time. This is nice,
> but it does mean that there's a pretty good chance that we'll end up
> stopping a CPU while it holds the port lock for the console
> UART. Specifically, I see a CPU get stopped while holding the port
> lock nearly 100% of the time on my sc7180-trogdor based Chromebook by
> enabling the "buddy" hardlockup detector and then doing:
> 
>   sysctl -w kernel.hardlockup_all_cpu_backtrace=1
>   sysctl -w kernel.hardlockup_panic=1
>   echo HARDLOCKUP > /sys/kernel/debug/provoke-crash/DIRECT
> 
> UART drivers are _supposed_ to handle this case OK and this is why
> UART drivers check "oops_in_progress" and only do a "trylock" in that
> case. However, before we enabled pseudo-NMI to stop CPUs it wasn't a
> very well-tested situation.
> 
> Now that we're testing the situation a lot, it can be seen that the
> Qualcomm GENI UART driver is pretty broken. Specifically, when I run
> my test case and look at the console output I just see a bunch of
> garbled output like:
> 
>   [  201.069084] NMI backtrace[  201.069084] NM[  201.069087] CPU: 6
>   PID: 10296 Comm: dnsproxyd Not tainted 6.7.0-06265-gb13e8c0ede12
>   #1 01112b9f14923cbd0b[  201.069090] Hardware name: Google Lazor
>   ([  201.069092] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DI[
>   201.069095] pc : smp_call_function_man[  201.069099]
> 
> That's obviously not so great. This happens because each call to the
> console driver exits after the data has been written to the FIFO but
> before it's actually been flushed out of the serial port. When we have
> multiple calls into the console one after the other then (if we can't
> get the lock) each call tells the UART to throw away any data in the
> FIFO that hadn't been transferred yet.
> 
> I've posted up a patch to change the arm64 core to avoid this
> situation most of the time [1] much like x86 seems to do, but even if
> that patch lands the GENI driver should still be fixed.
> 
> From testing, it appears that we can just delete the cancel/abort in
> the case where we weren't able to get the UART lock and the output
> looks good. It makes sense that we'd be able to do this since that
> means we'll just call into __qcom_geni_serial_console_write() and
> __qcom_geni_serial_console_write() looks much like
> qcom_geni_serial_poll_put_char() but with a loop. However, it seems
> safest to poll the FIFO and make sure it's empty before our
> transfer. This should reliably make sure that we're not
> interrupting/clobbering any existing transfers.
> 
> As part of this change, we'll also avoid re-setting up a TX at the end
> of the console write function if we weren't able to get the lock,
> since accessing "port->tx_remaining" without the lock is not
> safe. This is only needed to re-start userspace initiated transfers.
> 
> [1] https://lore.kernel.org/r/20231207170251.1.Id4817adef610302554b8aa42b090d57270dc119c@changeid
> 
> Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>

Reviewed-by: Bjorn Andersson <andersson@xxxxxxxxxx>

Regards,
Bjorn

> ---
> 
>  drivers/tty/serial/qcom_geni_serial.c | 27 +++++++++++++--------------
>  1 file changed, 13 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 7e78f97e8f43..06ebe62f99bc 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -488,18 +488,16 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>  
>  	geni_status = readl(uport->membase + SE_GENI_STATUS);
>  
> -	/* Cancel the current write to log the fault */
>  	if (!locked) {
> -		geni_se_cancel_m_cmd(&port->se);
> -		if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> -						M_CMD_CANCEL_EN, true)) {
> -			geni_se_abort_m_cmd(&port->se);
> -			qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> -							M_CMD_ABORT_EN, true);
> -			writel(M_CMD_ABORT_EN, uport->membase +
> -							SE_GENI_M_IRQ_CLEAR);
> -		}
> -		writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
> +		/*
> +		 * We can only get here if an oops is in progress then we were
> +		 * unable to get the lock. This means we can't safely access
> +		 * our state variables like tx_remaining. About the best we
> +		 * can do is wait for the FIFO to be empty before we start our
> +		 * transfer, so we'll do that.
> +		 */
> +		qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> +					  M_TX_FIFO_NOT_EMPTY_EN, false);
>  	} else if ((geni_status & M_GENI_CMD_ACTIVE) && !port->tx_remaining) {
>  		/*
>  		 * It seems we can't interrupt existing transfers if all data
> @@ -516,11 +514,12 @@ static void qcom_geni_serial_console_write(struct console *co, const char *s,
>  
>  	__qcom_geni_serial_console_write(uport, s, count);
>  
> -	if (port->tx_remaining)
> -		qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>  
> -	if (locked)
> +	if (locked) {
> +		if (port->tx_remaining)
> +			qcom_geni_serial_setup_tx(uport, port->tx_remaining);
>  		uart_port_unlock_irqrestore(uport, flags);
> +	}
>  }
>  
>  static void handle_rx_console(struct uart_port *uport, u32 bytes, bool drop)
> -- 
> 2.43.0.275.g3460e3d667-goog
> 




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [Linux for Sparc]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux