Hi Maciej and Oliver, On Mon, Dec 11, 2023 at 12:44:42PM -0800, Maciej Żenczykowski wrote: > On Mon, Dec 11, 2023 at 12:29 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote: > > > > On Thu, Dec 07, 2023 at 08:37:09PM +0100, Maciej Żenczykowski wrote: > > > On Thu, Dec 7, 2023 at 7:57 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote: > > > > > > > > Hello Oliver and Maciej, > > > > > > > > On Thu, Dec 07, 2023 at 10:41:51AM +0100, Oliver Neukum wrote: > > > > > Can you tell us which side in your test case produces many small packets? > > > > > > > > > > Furthermore, for testing purposes, could you decrease TX_TIMEOUT_NSECS in > > > > > f_ncm.c by an order of magnitude. > > > > > > > > The device side, Apalis iMX6, is the one producing the small packages > > > > and sending to my host PC. The VNC server is running on iMX6 device and, > > > > the client, on my host PC. > > > > > > > > I've decreased TX_TIMEOUT_NSECS from 300000 to 30000, but nothing > > > > changed, the behaviour is the same. > > > > > > > > On Thu, Dec 07, 2023 at 12:07:25PM +0100, Oliver Neukum wrote: > > > > > That suggests, but does not prove that the issue is on the host side. > > > > > Could you post the result of "ethtool -S" after a test run? We should > > > > > get statistics on the reasons for transmissions that way. > > > > > > > > Sure, for some reason I couldn't run ethtool on the iMX6 device: > > > > > > > > / # ethtool -S usb0 > > > > no stats available > > > > > > > > So I ran everything on my Debian host PC. First, without any changes on > > > > the device's kernel, this is the result (where the VNC is really > > > > slow/frozen): > > > > > > > > $ sudo ethtool -S enx3e5dcdead75e > > > > NIC statistics: > > > > tx_reason_ntb_full: 0 > > > > tx_reason_ndp_full: 0 > > > > tx_reason_timeout: 222 > > > > tx_reason_max_datagram: 0 > > > > tx_overhead: 42387 > > > > tx_ntbs: 222 > > > > rx_overhead: 38375 > > > > rx_ntbs: 529 > > > > > > > > Next, I decreased NTB_DEFAULT_IN_SIZE and NTB_OUT_SIZE from 16384 to > > > > 8192. The performance improved a bit, and this is the result: > > > > > > > > $ sudo ethtool -S enx42ff68c1000a > > > > NIC statistics: > > > > tx_reason_ntb_full: 0 > > > > tx_reason_ndp_full: 0 > > > > tx_reason_timeout: 321 > > > > tx_reason_max_datagram: 0 > > > > tx_overhead: 61617 > > > > tx_ntbs: 321 > > > > rx_overhead: 59341 > > > > rx_ntbs: 759 > > > > > > > > Finally, I changed from 8192 to 4096, and the perfomance was > > > > better: > > > > > > > > $ sudo ethtool -S enx3a601e306de1 > > > > NIC statistics: > > > > tx_reason_ntb_full: 0 > > > > tx_reason_ndp_full: 0 > > > > tx_reason_timeout: 56067 > > > > tx_reason_max_datagram: 0 > > > > tx_overhead: 83630876 > > > > tx_ntbs: 56064 > > > > rx_overhead: 25437595 > > > > rx_ntbs: 847920 > > > > > > > > At 4096 I can use the VNC with my app, click on buttons and see the mouse > > > > moving smoothly. Please note the device name changes because we're using > > > > random MAC addresses. 'ethtool' was running on my Debian host PC. I tested > > > > for 1min30s and then got the statics with ethtool for all 3 tests. > > > > > > > > On Thu, Dec 07, 2023 at 12:38:18PM +0100, Maciej Żenczykowski wrote: > > > > > > > > > > An every 1s (the default) ping is too rare to be of help I'd assume... > > > > > Try ping with various intervals (-i). All the way down to a flood ping (-f). > > > > > Most likely -i 0.01 would be enough to make things work better... > > > > > > > > This one is interesting, backing to the default value of 16384, I > > > > launched the VNC client which now is back to be slow/frozen. In > > > > parallel, I started the ping command. The first one with 1 second > > > > results the following (I'm pingging my host PC, using my device's > > > > terminal): > > > > > > > > / # ping 192.168.11.2 -i 1 > > > > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. > > > > 64 bytes from 192.168.11.2: icmp_seq=1 ttl=64 time=5070 ms > > > > 64 bytes from 192.168.11.2: icmp_seq=2 ttl=64 time=4003 ms > > > > 64 bytes from 192.168.11.2: icmp_seq=3 ttl=64 time=2963 ms > > > > 64 bytes from 192.168.11.2: icmp_seq=4 ttl=64 time=1923 ms > > > > 64 bytes from 192.168.11.2: icmp_seq=5 ttl=64 time=883 ms > > > > ^C > > > > --- 192.168.11.2 ping statistics --- > > > > 26 packets transmitted, 5 received, 80.7692% packet loss, time 25950ms > > > > rtt min/avg/max/mdev = 882.886/2968.250/5069.878/1478.425 ms, pipe 5 > > > > > > > > Ping is really slow and lost almost all packets. > > > > > > It is *slow*. But I don't think you're actually losing packets. > > > They simply haven't arrived *yet*. > > > Notice that the packets are sent every second, but they arrive after > > > ~5/~4/~3/~2/~1 s. > > > This suggests they all got delivered in one burst. > > > > > > You are missing 21 packets, but it's all the *final* 21 packets. > > > You'd need to run for a long long time (maybe 5 minutes or even more) > > > and see what happens then... > > > to see if the packets are indeed lost, or just not yet delivered... > > > > > > > Great, thanks for the explanation. Indeed, I ran the tests again, but > > now I waited 6 minutes for each test, and these are the results: > > > > -i 1 > > --- 192.168.11.2 ping statistics --- > > 349 packets transmitted, 349 received, 0% packet loss, time 361724ms > > rtt min/avg/max/mdev = 1.597/10751.276/27960.848/12974.107 ms, pipe 27 > > > > -i 0.1 > > --- 192.168.11.2 ping statistics --- > > 3301 packets transmitted, 3294 received, 0.212057% packet loss, time 361342ms > > rtt min/avg/max/mdev = 0.959/1131.714/3081.504/1385.261 ms, pipe 29 > > > > -i 0.01 > > --- 192.168.11.2 ping statistics --- > > 20718 packets transmitted, 20718 received, 0% packet loss, time 366465ms > > rtt min/avg/max/mdev = 0.568/178.764/587.956/234.702 ms, pipe 29 > > > > -flood > > --- 192.168.11.2 ping statistics --- > > 29173 packets transmitted, 29149 received, 0.0822679% packet loss, time 362065ms > > rtt min/avg/max/mdev = 0.518/127.472/583.971/204.277 ms, pipe 29, ipg/ewma 12.411/21.570 ms > > > > So yes, we can assume 0% of packet loss. Furthermore, I can see the > > bursts happening on my terminal, the ping command doesn't do anything > > and suddenly all the pings arrive at the same time, over and over again > > in a loop. > > Ok, this is great to know. No loss, just very very high delay. > > > > > Next test, I decreased > > > > to 0.1s: > > > > > > > > / # ping 192.168.11.2 -i 0.1 > > > > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. > > > > ... > > > > --- 192.168.11.2 ping statistics --- > > > > 129 packets transmitted, 122 received, 5.42636% packet loss, time 13971ms > > > > rtt min/avg/max/mdev = 1.752/999.274/2751.767/799.248 ms, pipe 26 > > > > > > > > While ping is running in parallel, VNC has a better performance, I can > > > > see my mouse running and click on some buttons. As soon as ping stops, > > > > VNC is slow/frozen again. Also we have less packet loss. > > > > > > Again - this packet loss may be just at the tail. > > > You'd need to run it for much much longer to see if packet loss is high or not. > > > It may just be that you always lose the last ~7 packets or so... > > > > > > > > > > > Next test, decreased to 0.01: > > > > > > > > / # ping 192.168.11.2 -i 0.01 > > > > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. > > > > ... > > > > --- 192.168.11.2 ping statistics --- > > > > 584 packets transmitted, 572 received, 2.05479% packet loss, time 10106ms > > > > rtt min/avg/max/mdev = 1.565/171.031/559.872/165.475 ms, pipe 28 > > > > > > > > And finally, the flood: > > > > > > > > / # ping 192.168.11.2 -f > > > > PING 192.168.11.2 (192.168.11.2) 56(84) bytes of data. > > > > ... > > > > --- 192.168.11.2 ping statistics --- > > > > 1314 packets transmitted, 1311 received, 0.22831% packet loss, time 16299ms > > > > rtt min/avg/max/mdev = 0.729/131.710/548.971/163.098 ms, pipe 28, ipg/ewma > > > > 12.413/7.802 ms > > > > > > > > While the flood is happening in parallel, the VNC runs very smoothly, > > > > and, again, as soon as it stops, it's back to slow/frozen. > > > > > > > > I believe here the ping command is helping to fullfil the buffer, that's > > > > why running it on parallel makes the VNC work... > > > > > > > > > Also which specific versions of the kernel are involved on both sides > > > > > of the connection. > > > > > > > > Device iMX6 is running Linux kernel v6.1.65, while my host PC is running > > > > Linux kernel v6.5.0. > > > > > > > > > There was a pretty recent fix related to packet aggregation recently > > > > > that could be either the fix or the cause. > > > > > "usb: gadget: ncm: Handle decoding of multiple NTB's in unwrap call" > > > > > Though I doubt it - I believe that was specific to how windows packs things. > > > > > > > > > > Also Krishna Kurapati has a (afaik still not merged) patch "usb: > > > > > gadget: ncm: Add support to configure wMaxSegmentSize" > > > > > that could be of use - though again, doubt it. > > > > > > > > I could also try to apply these patches and check how it goes. Thanks > > > > for the information. > > > > > > > > > Another thing that comes to mind, is that perhaps the device in > > > > > question does not have sufficiently high res timers? > > > > > There might be something in the kernel boot log / dmesg about hrtimer > > > > > resolution... > > > > > Maybe this just needs to be configurable... Or pick a smaller value > > > > > with broken hrtimer (if that's the issue), > > > > > or just disable aggregation if lowres hrtimers... etc... > > > > > > > > > > #define TX_TIMEOUT_NSECS 300000 > > > > > 300 us is too small to be noticeable by VNC imho, so I think something > > > > > *must* be misbehaving. > > > > > Perhaps timer resolution is bad and this 300us ends up being much larger...??? > > > > > > > > This is what I got from dmesg inside the iMX6 device: > > > > > > > > / # dmesg | grep timer > > > > [ 0.000000] Switching to timer-based delay loop, resolution 333ns > > > > [ 0.000019] clocksource: mxc_timer1: mask: 0xffffffff max_cycles: > > > > 0xffffffff, max_idle_ns: 637086815595 ns > > > > [ 0.001545] Calibrating delay loop (skipped), value calculated using > > > > timer frequency.. 6.00 BogoMIPS (lpj=30000) > > > > [ 0.203469] clocksource: Switched to clocksource mxc_timer1 > > > > > > > > > I wonder if the hrtimer_init() call should be with CLOCK_BOOTTIME > > > > > instead of CLOCK_MONOTONIC. > > > > > There could potentially be an issue with suspend, though I really doubt it. > > > > > > > > Also tested this, but it didn't change anything, VNC is still > > > > slow/frozen. > > > > > > > > > Another idea would be to add a gadget setting to disable tx > > > > > aggregation entirely... > > > > > (note that reducing from 8000 to 2000 doesn't actually turn off aggregation...) > > > > > > > > > > Have you tried reducing from 8000 to 4000 or 3500 or 3000? > > > > > Maybe there's some funkiness with page sizes?? > > > > > > > > Sorry, do you mean reducing the both NTB_DEFAULT_IN_SIZE and NTB_OUT_SIZE? > > > > > > I'm not sure ;-) > > > > > > > If so, I tried to reduce from 16384 to 8192, 4096 and 2048. Everytime I > > > > reduced the value, it got better and better, VNC running smoothly. > > > > > > > > I will keep testing to see if I get something. > > > > > > > > Thanks and regards, > > > > Hiago. > > > > > > I think we need to figure out whether the problem is host->device or > > > device->host. > > > Or perhaps you do already know this?? > > > Maybe this was mentioned already earlier in the thread and I missed it. > > > > > > > Actually I still don't know yet. > > Heh. > > > > > > Try reducing only in or only out... and changing only one side of the > > > connection. > > > > I did this test and it was very interesting, thanks for the > > reccomendation. > > It looks like only NTB_DEFAULT_IN_SIZE is relevant for > > this issue. I changed NTB_DEFAULT_IN_SIZE from 16384 to 2048 on the > > _device_ side (Apalis iMX6) and left the NTB_OUT_SIZE with the default > > value of 16384, and now I can use my VNC client on the host side (move the > > mouse, click some buttons...). Very fast and responsive I would say. > > > > Does it mean the issue is on the HOST side? > > Well... that's what I would assume based on that, but... > > I think what we've really found is a bug: > unsigned max_size = ncm->port.fixed_in_len; > is I think using fixed_in_len instead of using out_len. > > *OR* > > This is the gadget/device side and IN/OUT are always from the point of > view of the host. > ie. IN to the host, so transmitted by the gadget/device. > > It's bound to be one or the other. And while initially I thought the former, > I think it's likely the latter. > I vaguelly recall having run into something with in/out being > host-relative somewhere before. > > > > Maybe change TX_MAX_NUM_DPE to 1 (or other values) > > > > > > Maybe change > > > > > > if (ncm->skb_tx_data > > > && (ncm->ndp_dgram_count >= TX_MAX_NUM_DPE > > > || (ncm->skb_tx_data->len + > > > div + rem + skb->len + > > > ncm->skb_tx_ndp->len + ndp_align + (2 * dgram_idx_len)) > > > > max_size)) { > > > > > > to just if (ncm->skb_tx_data) > > > > Changing the 'if' statement to only 'if (ncm->skb_tx_data)' also made it > > work! VNC running smoothly on my VNC client. Again, I applied this patch > > to the device's kernel, Apalis iMX6. > > Here you modified only the device's tx - basically turning off > buffering/aggregation. > (and likely poorly at that since - I think - this approach still ends > up buffering one packet, > because it seems to send the previous packet when it gets the next one) > > ...and in general it seems more likely something is busted on the > device, then on the host... > > > I will put some print debugging here to see if I can figure out > > something else. > > Everything is pointing towards the timer not firing... > Perhaps try instrumenting the timer callback to see if it ever runs. > ncm_tx_timeout <-- add a printk in here > See if the timer registration functions aren't returning non-caught errors. > hrtimer_try_to_cancel - returns an int - it's not checked. - maybe > this always claims there's no timer to cancel? > but afaict: > hrtimer_start - does not return anything > hrtimer_init - also does not appear to return anything... I was thinking about the same, the timer is not firing at all. I put the printk's as suggested, and the ncm_tx_is being triggered. Basically, when I first turn on my device, the try_to_cancel function is always returning -1, which means "when the timer is currently executing the callback function and cannot be stopped". When I launch my VNC client, a see the following pattern: [ 94.926136] hfranco: hrtimer_start called. [ 94.926259] hfranco: try_to_cancel 1 [ 94.926268] hfranco: hrtimer_start called. [ 94.926391] hfranco: try_to_cancel 1 The hrtimer is called and then try_to_cancel returns 1, which means it was canceled "when the timer was active" (following the function comment). So I guess the hrtimer is somehow working as expected (at least being called). I also calculated the time between the consecutive messages of timer started and trying to cancel the timer, and this is what I got: 94.926259 - 94.926136 = 0.000123 seconds 94.926391 - 94.926268 = 0.000123 seconds 94.927548 - 94.926401 = 0.001147 seconds 94.927818 - 94.927585 = 0.000233 seconds 94.928338 - 94.927836 = 0.000502 seconds 94.928590 - 94.928360 = 0.000230 seconds 94.929330 - 94.928608 = 0.000722 seconds 94.929609 - 94.929363 = 0.000246 seconds I'm not sure how accurate this is, but we can clearly sometimes see it gets more than 300us. On Tue, Dec 12, 2023 at 10:48:01AM +0100, Oliver Neukum wrote: > On 11.12.23 21:44, Maciej Żenczykowski wrote: > > On Mon, Dec 11, 2023 at 12:29 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote: > > > > > > On Thu, Dec 07, 2023 at 08:37:09PM +0100, Maciej Żenczykowski wrote: > > > > On Thu, Dec 7, 2023 at 7:57 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote: > > Hi, > > > > > > On Thu, Dec 07, 2023 at 12:07:25PM +0100, Oliver Neukum wrote: > > > > > > That suggests, but does not prove that the issue is on the host side. > > > > > > Could you post the result of "ethtool -S" after a test run? We should > > > > > > get statistics on the reasons for transmissions that way. > > > > > > Finally, I changed from 8192 to 4096, and the perfomance was > > > > > better: > > > > > > > > > > $ sudo ethtool -S enx3a601e306de1 > > > > > NIC statistics: > > > > > tx_reason_ntb_full: 0 > > > > > tx_reason_ndp_full: 0 > > > > > tx_reason_timeout: 56067 > > This has grown two orders of magnitude. > > > > > > tx_reason_max_datagram: 0 > > > > > tx_overhead: 83630876 > > > > > tx_ntbs: 56064 > > > > > rx_overhead: 25437595 > > > > > rx_ntbs: 847920 > > > > > > > > > > At 4096 I can use the VNC with my app, click on buttons and see the mouse > > > > > moving smoothly. Please note the device name changes because we're using > > > > > random MAC addresses. 'ethtool' was running on my Debian host PC. I tested > > > > > for 1min30s and then got the statics with ethtool for all 3 tests. > > As you are testing for a constant time, the increase in transmissions > due to timeouts also decreases latency by two orders of magnitude. > Though this does not ultimately tell us which side is responsible. > > > > > While the flood is happening in parallel, the VNC runs very smoothly, > > > > > and, again, as soon as it stops, it's back to slow/frozen. > > > > > > > > > > I believe here the ping command is helping to fullfil the buffer, that's > > > > > why running it on parallel makes the VNC work... > > Indeed. You can confirm this by running "ethool -S" before and after the ping. > During the ping tx_reason_timeout should stagnate and probably tx_reason_ndp_full > will go up. > > > > > > > #define TX_TIMEOUT_NSECS 300000 > > > > > > 300 us is too small to be noticeable by VNC imho, so I think something > > > > > > *must* be misbehaving. > > > > > > Perhaps timer resolution is bad and this 300us ends up being much larger...??? > > Now that you mention it and have taken a closer look I suspect this piece of code: > > } else if ((n < ctx->tx_max_datagrams) && (ready2send == 0) && (ctx->timer_interval > 0)) { > /* wait for more frames */ > /* push variables */ > ctx->tx_curr_skb = skb_out; > /* set the pending count */ > if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT) > ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT; > goto exit_no_skb; > > Hiago, could you try lowering CDC_NCM_TIMER_PENDING_CNT, if need be all the way to 1? > It is defined in include/linux/usb/cdc_ncm.h as 3 currently > This applies to the host side. On my side CDC_NCM_TIMER_PENDING_CNT is set to 2 by default, did you mean CDC_NCM_RESTART_TIMER_DATAGRAM_CNT? Despite of that, I tried to lower both CDC_NCM_TIMER_PENDING_CNT and CDC_NCM_RESTART_TIMER_DATAGRAM_CNT all the way down to 1, first the CDC_NCM_TIMER_PENDING_CNT, then CDC_NCM_RESTART_TIMER_DATAGRAM_CNT and finally both at the same time, but it didn't help. I've also put some printks as following: ctx->tx_curr_frame_num = n; printk("hfranco: tx_curr_frame_num = %d", n); if (n == 0) { printk("hfranco: n == 0"); /* wait for more frames */ /* push variables */ ctx->tx_curr_skb = skb_out; goto exit_no_skb; } else if ((n < ctx->tx_max_datagrams) && (ready2send == 0) && (ctx->timer_interval > 0)) { printk("hfranco: tx_max_datagrams = %d", ctx->tx_max_datagrams); printk("hfranco: timer_interval = %d", ctx->timer_interval); printk("hfranco: n inside else if = %d", n); /* wait for more frames */ /* push variables */ ctx->tx_curr_skb = skb_out; /* set the pending count */ if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT) ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT; goto exit_no_skb; } else { printk("hfranco: n inside else = %d", n); if (n == ctx->tx_max_datagrams) ctx->tx_reason_max_datagram++; /* count reason for transmitting */ I ran it on my host PC, compiled it as module for my Debian dekstop, and this is the dmesg: [ 9663.478807] hfranco: tx_curr_frame_num = 1 [ 9663.478816] hfranco: tx_max_datagrams = 40 [ 9663.478818] hfranco: timer_interval = 400000 [ 9663.478820] hfranco: n inside else if = 1 [ 9663.478822] hfranco: timer started [ 9663.479645] hfranco: tx_curr_frame_num = 1 [ 9663.479652] hfranco: n inside else = 1 And then it basically repeats. Looks like 'n' never passes the 1 value. By tweaking the flags mentioned before, 'n' got a value of 4, but that was the maximum value. I was wondering, why do you think this code looks suspicious? I basically just inserted some printks on the tx side, I will see if I can get something from the rx as well. > > Regards > Oliver Thanks a lot and regards, Hiago.