Re: Question regarding CDC NCM and VNC performance issue

Hiago De Franco <hiagofranco@xxxxxxxxx> · Thu, 7 Dec 2023 15:57:12 -0300

Hello Oliver and Maciej,

On Thu, Dec 07, 2023 at 10:41:51AM +0100, Oliver Neukum wrote:
> Can you tell us which side in your test case produces many small packets?
> 
> Furthermore, for testing purposes, could you decrease TX_TIMEOUT_NSECS in
> f_ncm.c by an order of magnitude.

The device side, Apalis iMX6, is the one producing the small packages
and sending to my host PC. The VNC server is running on iMX6 device and,
the client, on my host PC.

I've decreased TX_TIMEOUT_NSECS from 300000 to 30000, but nothing
changed, the behaviour is the same.

On Thu, Dec 07, 2023 at 12:07:25PM +0100, Oliver Neukum wrote:
> That suggests, but does not prove that the issue is on the host side.
> Could you post the result of "ethtool -S" after a test run? We should
> get statistics on the reasons for transmissions that way.

Sure, for some reason I couldn't run ethtool on the iMX6 device:

/ # ethtool -S usb0
no stats available

So I ran everything on my Debian host PC. First, without any changes on
the device's kernel, this is the result (where the VNC is really
slow/frozen):

$ sudo ethtool -S enx3e5dcdead75e
NIC statistics:
     tx_reason_ntb_full: 0
     tx_reason_ndp_full: 0
     tx_reason_timeout: 222
     tx_reason_max_datagram: 0
     tx_overhead: 42387
     tx_ntbs: 222
     rx_overhead: 38375
     rx_ntbs: 529

Next, I decreased NTB_DEFAULT_IN_SIZE and NTB_OUT_SIZE from 16384 to
8192. The performance improved a bit, and this is the result:

$ sudo ethtool -S enx42ff68c1000a
NIC statistics:
     tx_reason_ntb_full: 0
     tx_reason_ndp_full: 0
     tx_reason_timeout: 321
     tx_reason_max_datagram: 0
     tx_overhead: 61617
     tx_ntbs: 321
     rx_overhead: 59341
     rx_ntbs: 759

Finally, I changed from 8192 to 4096, and the perfomance was 
better:

$ sudo ethtool -S enx3a601e306de1
NIC statistics:
     tx_reason_ntb_full: 0
     tx_reason_ndp_full: 0
     tx_reason_timeout: 56067
     tx_reason_max_datagram: 0
     tx_overhead: 83630876
     tx_ntbs: 56064
     rx_overhead: 25437595
     rx_ntbs: 847920

At 4096 I can use the VNC with my app, click on buttons and see the mouse
moving smoothly. Please note the device name changes because we're using
random MAC addresses. 'ethtool' was running on my Debian host PC. I tested
for 1min30s and then got the statics with ethtool for all 3 tests.

On Thu, Dec 07, 2023 at 12:38:18PM +0100, Maciej Żenczykowski wrote:
> 
> An every 1s (the default) ping is too rare to be of help I'd assume...
> Try ping with various intervals (-i).  All the way down to a flood ping (-f).
> Most likely -i 0.01 would be enough to make things work better...

This one is interesting, backing to the default value of 16384, I
launched the VNC client which now is back to be slow/frozen. In
parallel, I started the ping command. The first one with 1 second
results the following (I'm pingging my host PC, using my device's
terminal):

/ # ping 192.168.11.2 -i 1
PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
64 bytes from 192.168.11.2: icmp_seq=1 ttl=64 time=5070 ms
64 bytes from 192.168.11.2: icmp_seq=2 ttl=64 time=4003 ms
64 bytes from 192.168.11.2: icmp_seq=3 ttl=64 time=2963 ms
64 bytes from 192.168.11.2: icmp_seq=4 ttl=64 time=1923 ms
64 bytes from 192.168.11.2: icmp_seq=5 ttl=64 time=883 ms
^C
--- 192.168.11.2 ping statistics ---
26 packets transmitted, 5 received, 80.7692% packet loss, time 25950ms
rtt min/avg/max/mdev = 882.886/2968.250/5069.878/1478.425 ms, pipe 5

Ping is really slow and lost almost all packets. Next test, I decreased
to 0.1s:

/ # ping 192.168.11.2 -i 0.1
PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
...
--- 192.168.11.2 ping statistics ---
129 packets transmitted, 122 received, 5.42636% packet loss, time 13971ms
rtt min/avg/max/mdev = 1.752/999.274/2751.767/799.248 ms, pipe 26

While ping is running in parallel, VNC has a better performance, I can
see my mouse running and click on some buttons. As soon as ping stops,
VNC is slow/frozen again. Also we have less packet loss.

Next test, decreased to 0.01:

/ # ping 192.168.11.2 -i 0.01
PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
...
--- 192.168.11.2 ping statistics ---
584 packets transmitted, 572 received, 2.05479% packet loss, time 10106ms
rtt min/avg/max/mdev = 1.565/171.031/559.872/165.475 ms, pipe 28

And finally, the flood:

/ # ping 192.168.11.2 -f
PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
...
--- 192.168.11.2 ping statistics ---
1314 packets transmitted, 1311 received, 0.22831% packet loss, time 16299ms
rtt min/avg/max/mdev = 0.729/131.710/548.971/163.098 ms, pipe 28, ipg/ewma
12.413/7.802 ms

While the flood is happening in parallel, the VNC runs very smoothly,
and, again, as soon as it stops, it's back to slow/frozen.

I believe here the ping command is helping to fullfil the buffer, that's
why running it on parallel makes the VNC work...

> Also which specific versions of the kernel are involved on both sides
> of the connection.

Device iMX6 is running Linux kernel v6.1.65, while my host PC is running
Linux kernel v6.5.0.

> There was a pretty recent fix related to packet aggregation recently
> that could be either the fix or the cause.
>   "usb: gadget: ncm: Handle decoding of multiple NTB's in unwrap call"
> Though I doubt it - I believe that was specific to how windows packs things.
> 
> Also Krishna Kurapati has a (afaik still not merged) patch "usb:
> gadget: ncm: Add support to configure wMaxSegmentSize"
> that could be of use - though again, doubt it.

I could also try to apply these patches and check how it goes. Thanks
for the information.

> Another thing that comes to mind, is that perhaps the device in
> question does not have sufficiently high res timers?
> There might be something in the kernel boot log / dmesg about hrtimer
> resolution...
> Maybe this just needs to be configurable...  Or pick a smaller value
> with broken hrtimer (if that's the issue),
> or just disable aggregation if lowres hrtimers... etc...
> 
> #define TX_TIMEOUT_NSECS 300000
> 300 us is too small to be noticeable by VNC imho, so I think something
> *must* be misbehaving.
> Perhaps timer resolution is bad and this 300us ends up being much larger...???

This is what I got from dmesg inside the iMX6 device:

/ # dmesg | grep timer
[    0.000000] Switching to timer-based delay loop, resolution 333ns
[    0.000019] clocksource: mxc_timer1: mask: 0xffffffff max_cycles:
0xffffffff, max_idle_ns: 637086815595 ns
[    0.001545] Calibrating delay loop (skipped), value calculated using
timer frequency.. 6.00 BogoMIPS (lpj=30000)
[    0.203469] clocksource: Switched to clocksource mxc_timer1

> I wonder if the hrtimer_init() call should be with CLOCK_BOOTTIME
> instead of CLOCK_MONOTONIC.
> There could potentially be an issue with suspend, though I really doubt it.

Also tested this, but it didn't change anything, VNC is still
slow/frozen.

> Another idea would be to add a gadget setting to disable tx
> aggregation entirely...
> (note that reducing from 8000 to 2000 doesn't actually turn off aggregation...)
> 
> Have you tried reducing from 8000 to 4000 or 3500 or 3000?
> Maybe there's some funkiness with page sizes??

Sorry, do you mean reducing the both NTB_DEFAULT_IN_SIZE and NTB_OUT_SIZE?

If so, I tried to reduce from 16384 to 8192, 4096 and 2048. Everytime I
reduced the value, it got better and better, VNC running smoothly.

I will keep testing to see if I get something.

Thanks and regards,
Hiago.