xilinx_uartps: transmitter race condition

Helmut Grohne <h.grohne@xxxxxxxxxx> · Thu, 4 May 2017 13:59:11 +0200

Last year, I sent a couple of patches trying to improve the
xilinx_uartps driver. Now I was allocated to work on this driver again.
Unfortunately, I don't have ready to merge patches at this point.

The original symptom that made me investigate the xilinx_uartps driver
was occasional kernel hangs and garbage coming out of the console. While
trying to reproduce the issue on a current kernel, I no longer
encountered any hangs, but I was able to reproduce the garbage.

The driver has an interrupt-driven mode (used when opening the console
device) and a polling mode (used by kmsg). The issue is easily
reproducible by rapidly writing to /dev/kmsg when having commit
e3538c37ee38 ("tty: xuartps: Beautify read-modify writes") reverted and
nothing has opened the console. In that case, the transmitter is
initially disabled and cdns_uart_console_write disables the transmitter
after writing (due to the reversion). Without the reversion, it still is
reproducible much less reliably by writing to kmsg and then opening the
port (which temporarily disables the transmitter in cdns_uart_startup
for reconfiguration). From all that I have seen, it seems that disabling
the transmitter too early after writing data to the fifo is the cause.
Whatever "too early" means.

On a typical remote USB uart, the symptom looks as if 1 to 6 characters
before finishing a cdns_uart_console_write (e.g. writing a line to
kmsg), garbage characters (often with high bits) are received. This
continues for 10 to 80 characters into the next write. Upon inspection
with an oscilloscope, there is a different picture. At the start of
where the USB uart sees garbage, a character takes slightly longer than
expected.  After the erroneous character, transmission continues
normally. It seems that the USB uart temporarily desynchronizes and
reads overlapping characters, which I perceive as garbage.

In 2016, I sent a patch[1] for waiting longer after a
cdns_uart_console_write. It not only waits for the transmitter fifo to
empty (as the current driver does), but also waits for the TACTIVE bit
in the channel status register to clear. The current driver does not
look at the TACTIVE bit at all. Unfortunately, I am still able to rarely
reproduce the symptoms with this patch applied. It seems as if the
device is still not finished transmitting even after indicating TXEMPTY
and clearing TACTIVE. Increasing the wait beyond makes the symptoms
disappear entirely. Figuring how long to wait is the difficult part
though. It seems that after TXEMPTY is indicated, waiting three times as
long as it takes to clear TACTIVE is sufficient at a baud rate of
115200.

What is a good way to move on from here? I can send a patch that
increases the wait time using the aforementioned heuristic, but that
still doesn't tell at all why the issue happens and how long one
actually needs to wait. Any ideas?

I also sent a patch[2] that fixes a typo in cdns_uart_startup, that
causes the transmitter to be disabled. It still is applicable today and
I think it should be applied.

I also proposed making the logic for enabling/disabling the transmitter
more consistent (after commit e3538c37ee38 changed it in a non-obvious
way). In a discussing with Sören Brinkmann, we agreed that the
transmitter should be remain enabled after a write. Cleaning up other
logic to be consistent with that decision is what another earlier
patch[3] does.

Should I try resending any of these patches?

Helmut

[1] http://marc.info/?l=linux-serial&m=146977837419015
[2] http://marc.info/?l=linux-serial&m=146977916219251
[3] http://marc.info/?l=linux-serial&m=147012929717986 (includes patch [2])
--
To unsubscribe from this list: send the line "unsubscribe linux-serial" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html