> I also tried running cachestat but didn't get anything interesting: > Counting cache functions... Output every 1 seconds. > TIME HITS MISSES DIRTIES RATIO BUFFERS_MB CACHE_MB > 10:06:59 1020 5 0 99.5% 0 2 > 10:07:00 1029 0 0 100.0% 0 2 > 10:07:01 1013 0 0 100.0% 0 2 > 10:07:02 1029 0 0 100.0% 0 2 > 10:07:03 1029 0 0 100.0% 0 2 > 10:07:04 997 0 0 100.0% 0 2 > 10:07:05 1013 0 0 100.0% 0 2 > (I started iperf at 10:07:00). Try looking at the L1 cache performance. For this class of device, the L1 code cache is probably too small to contain the active parts of the network stack. The less cache thrashing you have, the faster the stack will go. Maybe try compiling with -Os so it optimises for size. Build a custom kernel with everything you don't need turned off. Look at the work being done to batch process packets. Rather than passing one packet at a time through the network stack, it passes a linked list of packets to each stage in the stack. That should result in less cache misses per packet. But not all layers in the stack support this batching. See if you can find out where it is being unbatched, and why. Can you influence this, disable build options, or work on the code to pass batches further along the stack. Andrew