On 01/29/2016 08:35 AM, John Ogness wrote: > Hi Peter, > > On 2016-01-25, Peter Hurley <peter@xxxxxxxxxxxxxxxxxx> wrote: >>> The DMA-enabled OMAP UART driver in its current form queues 48 bytes >>> for a DMA-RX transfer. After the transfer is complete, a new transfer >>> of 48 bytes is queued. The DMA completion callback runs in tasklet >>> context, so a reschedule with context switch is required for the >>> completion to be processed and the next 48 bytes to be queued. >>> >>> When running at a high speed such as 3Mbit, the CPU has 128us between >>> when the DMA hardware transfer completes and when the DMA hardware >>> must be fully prepared for the next transfer. For an embedded board >>> running applications, this does not give the CPU much time. If the >>> UART is using hardware flow control, this situation results in a >>> dramatic decrease in real transfer speeds. If flow control is not >>> used, the CPU will almost certainly be forced to drop data. >> >> I'm not convinced by this logic at all. >> Tasklets are not affected by the scheduler because they run in softirq. >> Or is this -RT? > > Softirq runs as SCHED_OTHER. It is quite easy to create a scenario where > DMA completion tasklets for this driver are not being serviced fast > enough. > >> I'm not seeing this problem on other platforms at this baud rate, > > Do you run 3Mbit on other platforms without hardware flow control? Yes, but only unidirectionally. > I mention this because turning on hardware flow control can cover up the > driver shortcomings by slowing down the transfers. What good is 3Mbit > hardware if the driver never lets it get above 500Kbit on bulk > transfers? That's interesting. I wonder why the 6x hit when using h/w flow control. Any thoughts on that? >> and on this platform, all I see is lockups with DMA. > > I have seen (and fixed) interesting issues with the AM335x eDMA, but I > have not experienced lockups in any of my testing. I'd be curious how > you trigger that. I haven't tested it since 4.1. I'll go back and re-enable DMA and retest. >> What is the test setup to reproduce these results? > > Two Beaglebone boards connected via ttyS1. ttyS1's are set to raw mode > at 3Mbit. > > sender: cat bigfile > /dev/ttyS1 > receiver: cat /dev/ttyS1 > bigfile Ok, I can repro something similar. > I am working on creating concrete examples that demonstrate not only > that this patch series reduces system load (and thus can increase > throughput on heavy system loads with hardware flow control), but also > that it is able to handle baud rates without data loss well beyond the > current implementation when no flow control is used. > > I wanted to wait until I had all the results before answering your > email. But I'm getting caught up in other tasks right now, so it may > take a few more weeks. Ok. So just to be clear here: this patchset is really all about performance improvement and not correct operation? >>> This patch series modifies the UART driver to use cyclic DMA transfers >>> with a growable ring buffer to accommodate baud rates. The ring buffer is >>> large enough to hold at least 1s of RX-data. >>> (At 3Mbit that is 367KiB.) >> >> Math slightly off because the frame is typically 10 bits, not 8. > > I was thinking 8 was the minimal frame size. Thanks for pointing that > out. A frame can contain 7-12 bits so I will modify the code to create a > buffer appropriate for the UART settings. At 3Mbit with 5n1 the driver > would require a 419KiB ring buffer (8929 DMA periods of 48 bytes). More about this below. >>> In order to ensure that data in the ring buffer is not overwritten before >>> being processed by the tty layer, a hrtimer is used as a watchdog. >> >> How'd it go from "We're just missing 128us window" to "This holds 1s >> of data"? > > First, you need to recognize that DMA completion tasklets can be delayed > significantly due to interrupt loads or rtprio processes (even on non-RT > systems). And at 3Mbit we are talking about >12000 interrupts per > second! Not sure I see 12000 ints/sec. unless you're talking about full-duplex at max rate in both directions? 3Mbit/sec / 10-bit frame / 48 bytes/dma = 6250 ints/sec. But again, interrupt load is not going to result in 100ms service intervals. So I think we're really talking about a (misbehaved) rtprio process that's starving i/o. > When using cyclic transfers, the only real concern is that the DMA > overwrites data in the ring buffer before the CPU has processed it due > to tasklet delays. That is what the hrtimer watchdog is for. > > Assuming the DMA is zooming along at full speed, the watchdog must be > able to trigger before the ring buffer can fill up. If the watchdog sees > the ring buffer is getting full, it pauses the DMA engine. But with > cyclic DMA we never know if the DMA is zooming or sitting idle. So even > on an idle system, the watchdog must assume DMA zooming and continually > fire to check the status. > > I chose 1 second buffer sizes and set the watchdog to fire at half > that. On an idle system you will see at most 2 new interrupts per second > due to this patch series. I thought that would be an acceptable trade > off. Whether the watchdog should fire at 50% buffer full or say 90% > buffer full is something that could be debated. But to answer your > question, the big ring buffer is really to keep the watchdog interrupts > low frequency. Ok, but your fundamental premise is that all of this is an acceptable space-time tradeoff for everyone using this platform, when it's not. So I'm trying to understand the actual use case you're trying to address. I doubt that's 5n1, full-duplex. >> And with a latency hit this bad, you'll never get the data to the >> process because the tty buffer kworker will buffer-overflow too and >> its much more susceptible to timing latency (although not as bad now >> that it's exclusively on the unbounded workqueue). > > Yes, you are correct. But I think that is a problem that should be > addressed at the tty layer. I disagree. I think you should fix the source of 500ms latency. Regards, Peter Hurley -- To unsubscribe from this list: send the line "unsubscribe dmaengine" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html