Re: Question regarding CDC NCM and VNC performance issue

Oliver Neukum <oneukum@xxxxxxxx> · Tue, 12 Dec 2023 10:48:01 +0100

On 11.12.23 21:44, Maciej Żenczykowski wrote:
On Mon, Dec 11, 2023 at 12:29 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote:

On Thu, Dec 07, 2023 at 08:37:09PM +0100, Maciej Żenczykowski wrote:
On Thu, Dec 7, 2023 at 7:57 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote:

Hi,

On Thu, Dec 07, 2023 at 12:07:25PM +0100, Oliver Neukum wrote:
That suggests, but does not prove that the issue is on the host side.
Could you post the result of "ethtool -S" after a test run? We should
get statistics on the reasons for transmissions that way.

Finally, I changed from 8192 to 4096, and the perfomance was
better:

$ sudo ethtool -S enx3a601e306de1
NIC statistics:
      tx_reason_ntb_full: 0
      tx_reason_ndp_full: 0
      tx_reason_timeout: 56067

This has grown two orders of magnitude.

      tx_reason_max_datagram: 0
      tx_overhead: 83630876
      tx_ntbs: 56064
      rx_overhead: 25437595
      rx_ntbs: 847920

At 4096 I can use the VNC with my app, click on buttons and see the mouse
moving smoothly. Please note the device name changes because we're using
random MAC addresses. 'ethtool' was running on my Debian host PC. I tested
for 1min30s and then got the statics with ethtool for all 3 tests.

As you are testing for a constant time, the increase in transmissions
due to timeouts also decreases latency by two orders of magnitude.
Though this does not ultimately tell us which side is responsible.
While the flood is happening in parallel, the VNC runs very smoothly,
and, again, as soon as it stops, it's back to slow/frozen.

I believe here the ping command is helping to fullfil the buffer, that's
why running it on parallel makes the VNC work...

Indeed. You can confirm this by running "ethool -S" before and after the ping.
During the ping tx_reason_timeout should stagnate and probably tx_reason_ndp_full
will go up.

#define TX_TIMEOUT_NSECS 300000
300 us is too small to be noticeable by VNC imho, so I think something
*must* be misbehaving.
Perhaps timer resolution is bad and this 300us ends up being much larger...???

Now that you mention it and have taken a closer look I suspect this piece of code:

        } else if ((n < ctx->tx_max_datagrams) && (ready2send == 0) && (ctx->timer_interval > 0)) {
                /* wait for more frames */
                /* push variables */
                ctx->tx_curr_skb = skb_out;
                /* set the pending count */
                if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT)
                        ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT;
                goto exit_no_skb;

Hiago, could you try lowering CDC_NCM_TIMER_PENDING_CNT, if need be all the way to 1?
It is defined in include/linux/usb/cdc_ncm.h as 3 currently
This applies to the host side.

	Regards
		Oliver