Hi, On Fri, May 24, 2024 at 5:17 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote: > > 1. It appears that M_GP_LENGTH can still advance after the FIFO > becomes 0, which is extra proof that the transfer is still happening > even though the FIFO says it's done. Presumably we could keep track of > how many bytes we have enqueued into the FIFO for this command and > then compare. As I was trying to do this, though, I noticed another > option... > > 2. It appears that instead of "cancelling" the current command we can > just issue a new 0-byte transfer and wait for the 0-byte transfer to > be "done". This causes geni to give us back a "M_CMD_OVERRUN" > interrupt, but that's fine and we can ignore it. That interrupt just > says "hey, you gave me a command before the previous one was done" but > it does seem to properly accept the new command and it doesn't drop > any bytes. > > ...it turns out that we (apparently) already have been using option #2 > to interrupt a transfer without dropping bytes. When the UART is > shared between an agetty and the kernel console this happens all the > time. In qcom_geni_serial_console_write() we'll issue a new command > before finishing a current one and then re-issue the current command > with any remaining bytes. So not only should this be safe but it's > already tested to work. > > I'll need to spend a little more time on this to really confirm it > works as I expect and I'll send up a v2 using approach #2. Just to connect the dots more, I did more testing and option #2 totally didn't work. The console/kdb stuff was working mostly through ugly fragile hacks. Polling GP_LENGTH seemed like the only option, though when I dug more I found that this wasn't the right place to do it anyway... > Also note that while spending more time on this I found _yet another_ > bug, this one more serious. My original testing was done on kernel 6.6 > (with stable backports) and I just did confirmation on mainline. > That's why I didn't see this new bug originally. ...but this time I > spent more time testing on mainline. It turns out that the recent > patches for kfifo on mainline have badly broken geni serial. > Specifically, if you just do "cat /var/log/messages" and then "Ctrl-C" > the machine will hard lockup! Yikes! This is yet another side effect > of the geni "packet"-based protocol biting us (so related to the > problems in ${SUBJECT}, but not identical). Whenever we setup a TX > transfer we set it up for the number of bytes in the queue at the > time. If that number goes down then we're in trouble. Specifically, it > can be noted that: > * When we start transmitting we look at the current queue size, setup > a transfer, and store "tx_remaining". > * Whenever there's space in the FIFO we add bytes and remove them from > the queue and "tx_remaining". > * We don't ever expect bytes to disappear from the queue. I think in > the old code if this happened we're just transfer some bogus bytes. > Now we'll loop in qcom_geni_serial_send_chunk_fifo() because > uart_fifo_out() will keep returning 0. > > I'll try to take a gander at that, too... I spent _far_ too long poking at this and I've finally come up with a v2 that _I think_ fixes everything. For reference: https://lore.kernel.org/r/20240530224603.730042-1-dianders@xxxxxxxxxxxx