Hi, On Mon, Jun 24, 2024 at 5:43 AM Johan Hovold <johan@xxxxxxxxxx> wrote: > > On Mon, Jun 10, 2024 at 03:24:26PM -0700, Douglas Anderson wrote: > > The fact that the Qualcomm GENI hardware interface is based around > > "packets" is really awkward to fit into Linux's UART design. > > Specifically, in order to send bytes you need to start up a new > > "command" saying how many bytes you want to send and then you need to > > send all those bytes. Once you've committed to sending that number of > > bytes it's very awkward to change your mind and send fewer, especially > > if you want to do so without dropping bytes on the ground. > > > > There may be a few cases where you might want to send fewer bytes than > > you originally expected: > > 1. You might want to interrupt the transfer with something higher > > priority, like the kernel console or kdb. > > 2. You might want to enter system suspend. > > 3. The user might have killed the program that had queued bytes for > > sending over the UART. > > > > Despite this awkwardness the Linux driver has still tried to send > > bytes using large transfers. Whenever the driver started a new > > transfer it would look at the number of bytes in the OS's queue and > > start a transfer for that many. The idea of using larger transfers is > > that it should be more efficient. When you're in the middle of a large > > transfer you can get interrupted when the hardware FIFO is close to > > empty and add more bytes in. Whenever you get to the end of a transfer > > you have to wait until the transfer is totally done before you can add > > more bytes and, depending on interrupt latency, that can cause the > > UART to idle a bit. > > As I mentioned last week, the slowdown from this is quite noticeable > (e.g. 25% slowdown at @115200), but this may be the price we need to pay > for correctness, at least temporarily. > > An alternative might be to switch to using a 16 byte fifo. This should > reduce console latency even further, and may be able avoid the idling > UART penalty by continuing to use the watermark interrupt for refilling > the FIFO. I'm a bit confused. Right now we're using (effectively) a 64-byte FIFO. The FIFO is 16-words deep and we have 4 bytes per word. ...so I'm not sure what you mean by switching to a 16-byte FIFO. Do you mean to make less use of the FIFO, or something else? Overall the big problem I found in all my testing was that I needed to wait for a "command done" before kicking off a new command. When the "command done" arrives then the UART has stopped transmitting and you've got to suffer an interrupt latency before you can start transferring again. Essentially: 1. Pick a transfer size. 2. You can keep sending bytes / using the FIFO efficiently as long as there are still bytes left in the transfer. 3. When you get to the end of the transfer, you have to wait for the UART to stop, report that it's done, and then suffer an interrupt latency to start a new transfer. So to be efficient you want to pick a big transfer size but if there's any chance that you might not need to transfer that many bytes then you need to figure out what to do. If you can handle that properly then that's great. If not then we have to make sure we never kick off a transfer that we might not finish. I'd also mention that, as talked about in my response to your other patch [1], I'm not seeing a 25% slowdown. I tested both with my simple proposal and with this whole series applied and my slowdown is less than 2%. I guess there must be something different with your setup? Trying to think about what kind of slowdown would be reasonable for my patch series at 115200: a) We send 64 bytes efficiently, which takes 5.6ms (64 * 1000 / 11520) b) We stop transferring and wait for an interrupt. c) We start transferring 64 bytes again. Let's say that your interrupt latency is 1 ms, which would be really terrible. In that case you'll essentially transfer 64 bytes in 6.6ms instead of 5.6 ms, right? That would be an 18% hit. Let's imagine something more sensible and say that most of the time you can handle an interrupt in 100 ms. That would be about a 1.7% slowdown, which actually matches what I was seeing. For reference, even an old arm32 rk3288-veyron device I worked with years ago could usually handle interrupts in ~100-200 ms since dwc2 needs you to handle at least one (sometimes more) interrupt per USB uFrame (250ms). ...so I'm confused about where your 25% number is coming from... [1] https://lore.kernel.org/r/CAD=FV=UwyzA614tDoq7BntW1DWmic=DOszr+iRJVafVEYrXhpw@xxxxxxxxxxxxxx