AM3517 DaVinci EMAC Ethernet performance issues

CF Adad <cfadad@xxxxxxxxxxxxxx> · Wed, 3 Apr 2013 19:36:07 -0700 (PDT)

(( Attempting to re-post this since Yahoo! shipped the previous one as HTML... ))

All,

My team is presently seeing *extremely poor* (on the order of single-digit Mbps) Ethernet performance out of an AM3517-based COM (Technexion's TAM-3517 in this case) when it _transmits TCP_. Receiving TCP appears to happen fine, and our UDP transmit and receive appears pretty solid. Is anyone else seeing anything like this on an AM3517-based platform? (I have a CompuLab CM-T3517 that I'll try to get to by the end of this week for comparison.)

I reported a similar, perhaps related, issue nearly a year ago at http://thread.gmane.org/gmane.linux.ports.arm.omap/78647 & http://e2e.ti.com/support/arm/sitara_arm/f/416/t/195442.aspx, and never heard much in response. Though the performance of the EMAC port has never been stellar (others have admitted that), we've continued working with the COM because the network performance our tests were seeing at the time was more than adequate to our tasks at hand. Recently however, while testing our latest hardware we hit this nasty performance snag and that caused us to revisit this entirely. Frustratingly, these tests are showing that that performance now appears to be way worse than anything we previously saw, on both our custom hardware and dev. kit systems.

The behavior is easily characterized using 'iperf'. If the TAM hosts the iperf server (i.e. receives TCP using 'iperf -s'), a client can connect to it and run ~90Mbps forever. That's perfect. If those roles are reversed however, and the TAM plays client (i.e. transmits TCP using 'iperf -i 10 -t 60 -c <server_ip>'), the data rate becomes sporadic and often plummets or even times out. Please see captures below. Although it misbehaves dramatically, the driver never registers a single error, xrun, nothing...

*** EMAC running server (receiving TCP) ***
$ iperf -i 10 -t 60 -c 10.22.0.17
------------------------------------------------------------
Client connecting to 10.22.0.17, TCP port 5001
TCP window size: 23.5 KByte (default)
------------------------------------------------------------
[ 3] local 10.22.255.5 port 60936 connected with 10.22.0.17 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 113 MBytes 94.5 Mbits/sec
[ 3] 10.0-20.0 sec 112 MBytes 94.3 Mbits/sec
[ 3] 20.0-30.0 sec 112 MBytes 94.0 Mbits/sec
[ 3] 30.0-40.0 sec 112 MBytes 94.2 Mbits/sec
[ 3] 40.0-50.0 sec 112 MBytes 94.3 Mbits/sec
[ 3] 50.0-60.0 sec 112 MBytes 94.0 Mbits/sec
[ 3] 0.0-60.0 sec 674 MBytes 94.2 Mbits/sec

*** EMAC running client (transmitting TCP) ***
# iperf -i 10 -t 60 -c 10.22.255.5
------------------------------------------------------------
Client connecting to 10.22.255.5, TCP port 5001
TCP window size: 19.6 KByte (default)
------------------------------------------------------------
[ 3] local 10.22.0.17 port 43185 connected with 10.22.255.5 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 92.5 MBytes 77.6 Mbits/sec
[ 3] 10.0-20.0 sec 102 MBytes 85.3 Mbits/sec
[ 3] 20.0-30.0 sec 98.6 MBytes 82.7 Mbits/sec
[ 3] 30.0-40.0 sec 55.4 MBytes 46.5 Mbits/sec
[ 3] 40.0-50.0 sec 2.73 MBytes 2.29 Mbits/sec
[ 3] 50.0-60.0 sec 1.26 MBytes 1.06 Mbits/sec
[ 3] 0.0-64.5 sec 352 MBytes 45.8 Mbits/sec

Since discovering this behavior at the end of last week, I have systematically gone back through our generations of our custom carrier boards as well as the TAM's Twister dev kit and proven that the issue is now present on everything we have. Since the behavior appears to have changed since we last aggressively tested this nearly a year ago, I'm assuming a slight software alteration somewhere is largely to blame. So I walked back through all of my recorded boot logs and retried running our main previous kernels (l-o 3.4-rc6 and l-o 3.5-rc4) as well as older versions of the bootloaders. In every case, the problem has remained.

The latest software we're running is still based on linux-omap's 3.5-rc4. We locked the kernel down there several months ago in order to stage for release, and until we discovered this last week it has been running _very_ stably. I have, however, continued to monitor the lists and major patch sites looking to see if any major bug fixes are released in the drivers we're using, etc. Since discovering this issue, I've also gone ahead and backported in many of the patches released by the folks I CC'd onto this message - at least those I could easily pull in without upgrading the kernel. Unless I'm overlooking something, it now looks to me like I have everything but the DT and OF stuff worked into our kernel. (I'm assuming that DT and OF stuff really does not impact performance. Is that a safe assumption?)  Unfortunately, pulling in those changes has not corrected this issue.

We've done network captures on our link, and the problem is very strange. The iperf client transmits data quickly and steadily for a while, but then all the sudden just stops. In the captures you can see an ACK come back from the server for the frame that was just sent, but then instead of immediately sending the next one it just sits there for sometimes several seconds. Then, it suddenly picks back up and starts running again. It's as though it just paused due to lack of data. Again, no errors or xruns are ever triggered, and even with full NETIF_MSG debugging on, we're getting nothing.

One other note: The more I play around with this the more I'm noticing that manually increasing the TCP window size helps things dramatically.

*** EMAC running client (transmitting TCP) WITH larger window ***
# iperf -i 10 -t 60 -c 10.22.255.5 -w 85K
------------------------------------------------------------
Client connecting to 10.22.255.5, TCP port 5001
TCP window size: 170 KByte (WARNING: requested 85.0 KByte)
------------------------------------------------------------
[ 3] local 10.22.0.17 port 43189 connected with 10.22.255.5 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 105 MBytes 88.3 Mbits/sec
[ 3] 10.0-20.0 sec 92.0 MBytes 77.2 Mbits/sec
[ 3] 20.0-30.0 sec 94.8 MBytes 79.6 Mbits/sec
[ 3] 30.0-40.0 sec 88.3 MBytes 74.1 Mbits/sec
[ 3] 40.0-50.0 sec 95.8 MBytes 80.3 Mbits/sec
[ 3] 50.0-60.0 sec 105 MBytes 87.9 Mbits/sec
[ 3] 0.0-60.0 sec 581 MBytes 81.2 Mbits/sec

While encouraging, I know I should not have to do this. This feels like it's simply masking the real problem.

I've looked closely sysctl parameters associated with the ipv4 stuff (we're not using ipv6), and have contrasted that against parameters for several systems around here. Again, I'm not finding anything obvious.

Has anyone seen anything like this out of TI's DaVinci EMAC before and/or does anyone have any idea what could be causing this? Any and all help in tracking this down would be greatly appreciated! To anyone willing to help, I'll happily provide as much info as I can. Please just ask.

While I would prefer to not do so given that it would void the months of testing we've already done, if necessary I can look into pushing our kernel forward toward the leading edge of the l-o series.  However, I would prefer other efforts be exhausted first given the sacrifice in testing that would require.

Thanks in advance and thanks to all who contribute to these excellent open source tools and products.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html