Re: Question regarding CDC NCM and VNC performance issue

Hiago De Franco <hiagofranco@xxxxxxxxx> · Mon, 11 Dec 2023 17:28:47 -0300

On Thu, Dec 07, 2023 at 08:37:09PM +0100, Maciej Żenczykowski wrote:
> On Thu, Dec 7, 2023 at 7:57 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote:
> >
> > Hello Oliver and Maciej,
> >
> > On Thu, Dec 07, 2023 at 10:41:51AM +0100, Oliver Neukum wrote:
> > > Can you tell us which side in your test case produces many small packets?
> > >
> > > Furthermore, for testing purposes, could you decrease TX_TIMEOUT_NSECS in
> > > f_ncm.c by an order of magnitude.
> >
> > The device side, Apalis iMX6, is the one producing the small packages
> > and sending to my host PC. The VNC server is running on iMX6 device and,
> > the client, on my host PC.
> >
> > I've decreased TX_TIMEOUT_NSECS from 300000 to 30000, but nothing
> > changed, the behaviour is the same.
> >
> > On Thu, Dec 07, 2023 at 12:07:25PM +0100, Oliver Neukum wrote:
> > > That suggests, but does not prove that the issue is on the host side.
> > > Could you post the result of "ethtool -S" after a test run? We should
> > > get statistics on the reasons for transmissions that way.
> >
> > Sure, for some reason I couldn't run ethtool on the iMX6 device:
> >
> > / # ethtool -S usb0
> > no stats available
> >
> > So I ran everything on my Debian host PC. First, without any changes on
> > the device's kernel, this is the result (where the VNC is really
> > slow/frozen):
> >
> > $ sudo ethtool -S enx3e5dcdead75e
> > NIC statistics:
> >      tx_reason_ntb_full: 0
> >      tx_reason_ndp_full: 0
> >      tx_reason_timeout: 222
> >      tx_reason_max_datagram: 0
> >      tx_overhead: 42387
> >      tx_ntbs: 222
> >      rx_overhead: 38375
> >      rx_ntbs: 529
> >
> > Next, I decreased NTB_DEFAULT_IN_SIZE and NTB_OUT_SIZE from 16384 to
> > 8192. The performance improved a bit, and this is the result:
> >
> > $ sudo ethtool -S enx42ff68c1000a
> > NIC statistics:
> >      tx_reason_ntb_full: 0
> >      tx_reason_ndp_full: 0
> >      tx_reason_timeout: 321
> >      tx_reason_max_datagram: 0
> >      tx_overhead: 61617
> >      tx_ntbs: 321
> >      rx_overhead: 59341
> >      rx_ntbs: 759
> >
> > Finally, I changed from 8192 to 4096, and the perfomance was
> > better:
> >
> > $ sudo ethtool -S enx3a601e306de1
> > NIC statistics:
> >      tx_reason_ntb_full: 0
> >      tx_reason_ndp_full: 0
> >      tx_reason_timeout: 56067
> >      tx_reason_max_datagram: 0
> >      tx_overhead: 83630876
> >      tx_ntbs: 56064
> >      rx_overhead: 25437595
> >      rx_ntbs: 847920
> >
> > At 4096 I can use the VNC with my app, click on buttons and see the mouse
> > moving smoothly. Please note the device name changes because we're using
> > random MAC addresses. 'ethtool' was running on my Debian host PC. I tested
> > for 1min30s and then got the statics with ethtool for all 3 tests.
> >
> > On Thu, Dec 07, 2023 at 12:38:18PM +0100, Maciej Żenczykowski wrote:
> > >
> > > An every 1s (the default) ping is too rare to be of help I'd assume...
> > > Try ping with various intervals (-i).  All the way down to a flood ping (-f).
> > > Most likely -i 0.01 would be enough to make things work better...
> >
> > This one is interesting, backing to the default value of 16384, I
> > launched the VNC client which now is back to be slow/frozen. In
> > parallel, I started the ping command. The first one with 1 second
> > results the following (I'm pingging my host PC, using my device's
> > terminal):
> >
> > / # ping 192.168.11.2 -i 1
> > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
> > 64 bytes from 192.168.11.2: icmp_seq=1 ttl=64 time=5070 ms
> > 64 bytes from 192.168.11.2: icmp_seq=2 ttl=64 time=4003 ms
> > 64 bytes from 192.168.11.2: icmp_seq=3 ttl=64 time=2963 ms
> > 64 bytes from 192.168.11.2: icmp_seq=4 ttl=64 time=1923 ms
> > 64 bytes from 192.168.11.2: icmp_seq=5 ttl=64 time=883 ms
> > ^C
> > --- 192.168.11.2 ping statistics ---
> > 26 packets transmitted, 5 received, 80.7692% packet loss, time 25950ms
> > rtt min/avg/max/mdev = 882.886/2968.250/5069.878/1478.425 ms, pipe 5
> >
> > Ping is really slow and lost almost all packets.
> 
> It is *slow*.  But I don't think you're actually losing packets.
> They simply haven't arrived *yet*.
> Notice that the packets are sent every second, but they arrive after
> ~5/~4/~3/~2/~1 s.
> This suggests they all got delivered in one burst.
> 
> You are missing 21 packets, but it's all the *final* 21 packets.
> You'd need to run for a long long time (maybe 5 minutes or even more)
> and see what happens then...
> to see if the packets are indeed lost, or just not yet delivered...
>

Great, thanks for the explanation. Indeed, I ran the tests again, but
now I waited 6 minutes for each test, and these are the results:

-i 1
--- 192.168.11.2 ping statistics ---
349 packets transmitted, 349 received, 0% packet loss, time 361724ms
rtt min/avg/max/mdev = 1.597/10751.276/27960.848/12974.107 ms, pipe 27

-i 0.1
--- 192.168.11.2 ping statistics ---
3301 packets transmitted, 3294 received, 0.212057% packet loss, time 361342ms
rtt min/avg/max/mdev = 0.959/1131.714/3081.504/1385.261 ms, pipe 29

-i 0.01
--- 192.168.11.2 ping statistics ---
20718 packets transmitted, 20718 received, 0% packet loss, time 366465ms
rtt min/avg/max/mdev = 0.568/178.764/587.956/234.702 ms, pipe 29

-flood
--- 192.168.11.2 ping statistics ---
29173 packets transmitted, 29149 received, 0.0822679% packet loss, time 362065ms
rtt min/avg/max/mdev = 0.518/127.472/583.971/204.277 ms, pipe 29, ipg/ewma 12.411/21.570 ms

So yes, we can assume 0% of packet loss. Furthermore, I can see the
bursts happening on my terminal, the ping command doesn't do anything
and suddenly all the pings arrive at the same time, over and over again
in a loop.

> > Next test, I decreased
> > to 0.1s:
> >
> > / # ping 192.168.11.2 -i 0.1
> > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
> > ...
> > --- 192.168.11.2 ping statistics ---
> > 129 packets transmitted, 122 received, 5.42636% packet loss, time 13971ms
> > rtt min/avg/max/mdev = 1.752/999.274/2751.767/799.248 ms, pipe 26
> >
> > While ping is running in parallel, VNC has a better performance, I can
> > see my mouse running and click on some buttons. As soon as ping stops,
> > VNC is slow/frozen again. Also we have less packet loss.
> 
> Again - this packet loss may be just at the tail.
> You'd need to run it for much much longer to see if packet loss is high or not.
> It may just be that you always lose the last ~7 packets or so...
> 
> >
> > Next test, decreased to 0.01:
> >
> > / # ping 192.168.11.2 -i 0.01
> > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
> > ...
> > --- 192.168.11.2 ping statistics ---
> > 584 packets transmitted, 572 received, 2.05479% packet loss, time 10106ms
> > rtt min/avg/max/mdev = 1.565/171.031/559.872/165.475 ms, pipe 28
> >
> > And finally, the flood:
> >
> > / # ping 192.168.11.2 -f
> > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data.
> > ...
> > --- 192.168.11.2 ping statistics ---
> > 1314 packets transmitted, 1311 received, 0.22831% packet loss, time 16299ms
> > rtt min/avg/max/mdev = 0.729/131.710/548.971/163.098 ms, pipe 28, ipg/ewma
> > 12.413/7.802 ms
> >
> > While the flood is happening in parallel, the VNC runs very smoothly,
> > and, again, as soon as it stops, it's back to slow/frozen.
> >
> > I believe here the ping command is helping to fullfil the buffer, that's
> > why running it on parallel makes the VNC work...
> >
> > > Also which specific versions of the kernel are involved on both sides
> > > of the connection.
> >
> > Device iMX6 is running Linux kernel v6.1.65, while my host PC is running
> > Linux kernel v6.5.0.
> >
> > > There was a pretty recent fix related to packet aggregation recently
> > > that could be either the fix or the cause.
> > >   "usb: gadget: ncm: Handle decoding of multiple NTB's in unwrap call"
> > > Though I doubt it - I believe that was specific to how windows packs things.
> > >
> > > Also Krishna Kurapati has a (afaik still not merged) patch "usb:
> > > gadget: ncm: Add support to configure wMaxSegmentSize"
> > > that could be of use - though again, doubt it.
> >
> > I could also try to apply these patches and check how it goes. Thanks
> > for the information.
> >
> > > Another thing that comes to mind, is that perhaps the device in
> > > question does not have sufficiently high res timers?
> > > There might be something in the kernel boot log / dmesg about hrtimer
> > > resolution...
> > > Maybe this just needs to be configurable...  Or pick a smaller value
> > > with broken hrtimer (if that's the issue),
> > > or just disable aggregation if lowres hrtimers... etc...
> > >
> > > #define TX_TIMEOUT_NSECS 300000
> > > 300 us is too small to be noticeable by VNC imho, so I think something
> > > *must* be misbehaving.
> > > Perhaps timer resolution is bad and this 300us ends up being much larger...???
> >
> > This is what I got from dmesg inside the iMX6 device:
> >
> > / # dmesg | grep timer
> > [    0.000000] Switching to timer-based delay loop, resolution 333ns
> > [    0.000019] clocksource: mxc_timer1: mask: 0xffffffff max_cycles:
> > 0xffffffff, max_idle_ns: 637086815595 ns
> > [    0.001545] Calibrating delay loop (skipped), value calculated using
> > timer frequency.. 6.00 BogoMIPS (lpj=30000)
> > [    0.203469] clocksource: Switched to clocksource mxc_timer1
> >
> > > I wonder if the hrtimer_init() call should be with CLOCK_BOOTTIME
> > > instead of CLOCK_MONOTONIC.
> > > There could potentially be an issue with suspend, though I really doubt it.
> >
> > Also tested this, but it didn't change anything, VNC is still
> > slow/frozen.
> >
> > > Another idea would be to add a gadget setting to disable tx
> > > aggregation entirely...
> > > (note that reducing from 8000 to 2000 doesn't actually turn off aggregation...)
> > >
> > > Have you tried reducing from 8000 to 4000 or 3500 or 3000?
> > > Maybe there's some funkiness with page sizes??
> >
> > Sorry, do you mean reducing the both NTB_DEFAULT_IN_SIZE and NTB_OUT_SIZE?
> 
> I'm not sure ;-)
> 
> > If so, I tried to reduce from 16384 to 8192, 4096 and 2048. Everytime I
> > reduced the value, it got better and better, VNC running smoothly.
> >
> > I will keep testing to see if I get something.
> >
> > Thanks and regards,
> > Hiago.
> 
> I think we need to figure out whether the problem is host->device or
> device->host.
> Or perhaps you do already know this??
> Maybe this was mentioned already earlier in the thread and I missed it.
> 

Actually I still don't know yet.

> Try reducing only in or only out... and changing only one side of the
> connection.

I did this test and it was very interesting, thanks for the
reccomendation. 
It looks like only NTB_DEFAULT_IN_SIZE is relevant for
this issue. I changed NTB_DEFAULT_IN_SIZE from 16384 to 2048 on the
_device_ side (Apalis iMX6) and left the NTB_OUT_SIZE with the default
value of 16384, and now I can use my VNC client on the host side (move the
mouse, click some buttons...). Very fast and responsive I would say.

Does it mean the issue is on the HOST side?

> Maybe change TX_MAX_NUM_DPE to 1 (or other values)
> 
> Maybe change
> 
> if (ncm->skb_tx_data
> && (ncm->ndp_dgram_count >= TX_MAX_NUM_DPE
> || (ncm->skb_tx_data->len +
> div + rem + skb->len +
> ncm->skb_tx_ndp->len + ndp_align + (2 * dgram_idx_len))
> > max_size)) {
> 
> to just if (ncm->skb_tx_data)

Changing the 'if' statement to only 'if (ncm->skb_tx_data)' also made it
work! VNC running smoothly on my VNC client. Again, I applied this patch
to the device's kernel, Apalis iMX6.

I will put some print debugging here to see if I can figure out
something else.

Thanks a lot,
Hiago.