Re: [PATCH v4 8/8] serial: qcom-geni: Rework TX in FIFO mode to fix hangs/lockups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On Tue, Jun 25, 2024 at 4:21 AM Johan Hovold <johan@xxxxxxxxxx> wrote:
>
> On Mon, Jun 24, 2024 at 02:15:07PM -0700, Doug Anderson wrote:
> > On Mon, Jun 24, 2024 at 5:43 AM Johan Hovold <johan@xxxxxxxxxx> wrote:
>
> > > As I mentioned last week, the slowdown from this is quite noticeable
> > > (e.g. 25% slowdown at @115200), but this may be the price we need to pay
> > > for correctness, at least temporarily.
> > >
> > > An alternative might be to switch to using a 16 byte fifo. This should
> > > reduce console latency even further, and may be able avoid the idling
> > > UART penalty by continuing to use the watermark interrupt for refilling
> > > the FIFO.
> >
> > I'm a bit confused. Right now we're using (effectively) a 64-byte
> > FIFO. The FIFO is 16-words deep and we have 4 bytes per word. ...so
> > I'm not sure what you mean by switching to a 16-byte FIFO. Do you mean
> > to make less use of the FIFO, or something else?
>
> I meant switching to using one-byte words so that we end up with a
> 16-byte FIFO where we don't have the issue of adding more data when the
> last word is not a full four-byte one.

Ah, I get it! I guess I would have described it as 1-byte per FIFO word.

Certainly that seems like something that's worth trying but, at least
in the past, I remember getting noticeably worse performance with it.
We used to be in that mode when kdb was enabled which I run with most
of the time. Depending on what you set the watermark level to you may
either end up spending a lot more resources servicing interrupts or
you might end up back in the case where you're stalling the transfer
because you couldn't service the interrupt fast enough. At 115.2, each
byte is about 87 microseconds, and draining a 16-byte FIFO is about
1.4ms. If you set the watermark at halfway then you'll get an
interrupt every 8 bytes or ~8x as many interrupts as with my patch
series. You'll also stall any time your interrupt latency is worse
than 694 microseconds. Hopefully that's not too often, though the
slowdowns you measured below make me worried.


> > Overall the big problem I found in all my testing was that I needed to
> > wait for a "command done" before kicking off a new command. When the
> > "command done" arrives then the UART has stopped transmitting and
> > you've got to suffer an interrupt latency before you can start
> > transferring again. Essentially:
> >
> > 1. Pick a transfer size.
> > 2. You can keep sending bytes / using the FIFO efficiently as long as
> > there are still bytes left in the transfer.
> > 3. When you get to the end of the transfer, you have to wait for the
> > UART to stop, report that it's done, and then suffer an interrupt
> > latency to start a new transfer.
> >
> > So to be efficient you want to pick a big transfer size but if there's
> > any chance that you might not need to transfer that many bytes then
> > you need to figure out what to do. If you can handle that properly
> > then that's great. If not then we have to make sure we never kick off
> > a transfer that we might not finish.
>
> Right. But with a 16 1-byte word FIFO, we may be able to kick of a
> really long transfer and just keep it running until it needs to be
> kicked again (cf. enabling TX). The console code can easily insert
> characters in the FIFO while the transfer is running (and would only
> have to wait for 16 characters to drain in the worst case).
>
> Effectively, most of the identified issues would just go away, as
> there's basically never any need to cancel anything except at port
> shutdown.

Yeah, though you'd still have to make sure that the corner cases
worked OK. You'll have to pick _some_ sort of fixed transfer size and
make sure that all the special cases / console / kdb work if they show
up right at the end of the transfer.

I was also a bit curious if there could be power implications with
leaving an active TX command always in place. Perhaps geni wouldn't be
able to drop some resources? Do you happen to know?


> > I'd also mention that, as talked about in my response to your other
> > patch [1], I'm not seeing a 25% slowdown. I tested both with my simple
> > proposal and with this whole series applied and my slowdown is less
> > than 2%. I guess there must be something different with your setup?
> > Trying to think about what kind of slowdown would be reasonable for my
> > patch series at 115200:
> >
> > a) We send 64 bytes efficiently, which takes 5.6ms (64 * 1000 / 11520)
> >
> > b) We stop transferring and wait for an interrupt.
> >
> > c) We start transferring 64 bytes again.
> >
> > Let's say that your interrupt latency is 1 ms, which would be really
> > terrible. In that case you'll essentially transfer 64 bytes in 6.6ms
> > instead of 5.6 ms, right? That would be an 18% hit. Let's imagine
> > something more sensible and say that most of the time you can handle
> > an interrupt in 100 ms. That would be about a 1.7% slowdown, which
> > actually matches what I was seeing. For reference, even an old arm32
> > rk3288-veyron device I worked with years ago could usually handle
> > interrupts in ~100-200 ms since dwc2 needs you to handle at least one
> > (sometimes more) interrupt per USB uFrame (250ms).
> >
> > ...so I'm confused about where your 25% number is coming from...
>
> I didn't do an in-depth analysis of the slowdown, but I did rerun the
> tests now and I'm still seeing a 22-24% slowdown on x1e80100 with rc5.
> This is a new platform so I compared with sc8280xp, which shows similar
> numbers even if it's slightly faster to begin with:
>
>                                         sc8280xp        x1e80100
>
>         rc5 full series                 61 s            67 s
>         rc5 last patch reverted         50 s            54 s
>
> I have a getty running and cat a 10x dmesg file of 543950 bytes to
> /dev/ttyMSM0 from an ssh session (just catting in a serial console gives
> similar numbers).

That's really weird / unexpected. Your hardware should be fancier than
mine so, if anything, I'd expect it to be faster. Is there something
causing you really bad interrupt latency or something? ...or is some
clock misconfigured and "geni" is behaving sub-optimally?

...although it wouldn't explain the slowness, I'd at least be a little
curious if you've confirmed that you're running with a 16-word FIFO
depth. See the function geni_se_get_tx_fifo_depth() where newer
hardware can actually have larger FIFO depths.

Just in case it matters, I'd be curious if you have
`CONFIG_IRQ_TIME_ACCOUNTING=y`

Oh: one last thing to confirm: do you have kernel console output
disabled for your tests? I've been doing tests with the kernel console
_not_ enabled over the serial port and just an agetty there. I could
believe things might be different if the kernel console was sending
messages over the same port.

-Doug





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [Linux for Sparc]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux