Hi Oliver and Maciej, Sorry for the delay on my message. I took some time to set up some debugging tools, hoping to find something more interesting. I couldn't find too much I guess, but I will describe the results here. First, I set up ftrace to track 'package_for_tx' function. My idea would be that I would see the delays between two consecutives calls of this function being more than 300us, this would say the code wasn't respecting the hrtimer. The trace log looks something like that: # tracer: function # # entries-in-buffer/entries-written: 468/468 #P:2 # # _-----=> irqs-off/BH-disabled # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / _-=> migrate-disable # |||| / delay # TASK-PID CPU# ||||| TIMESTAMP FUNCTION # | | | ||||| | | tezi-237 [001] dns.. 101.030667: package_for_tx <-ncm_wrap_ntb <idle>-0 [000] d.s.. 101.032771: package_for_tx <-ncm_wrap_ntb weston-228 [001] d.s.. 101.039430: package_for_tx <-ncm_wrap_ntb tezi-237 [000] d.s.. 101.041634: package_for_tx <-ncm_wrap_ntb <idle>-0 [001] d.s.. 101.042326: package_for_tx <-ncm_wrap_ntb <idle>-0 [001] d.s.. 101.055665: package_for_tx <-ncm_wrap_ntb weston-231 [001] d.s.. 101.057779: package_for_tx <-ncm_wrap_ntb <idle>-0 [001] d.s.. 101.068630: package_for_tx <-ncm_wrap_ntb weston-231 [001] d.s.. 101.085738: package_for_tx <-ncm_wrap_ntb tezi-237 [000] D.s.. 101.088611: package_for_tx <-ncm_wrap_ntb So I created a simple script to calculate the delta time between two consecutive calls and give me at the end how many of them are above 300us. This is the result: Total timestamps above 300 microseconds: 440 of 742 Percentage of timestamps above 300 microseconds: 59.30% Then I decided to lower down the TX_MAX_NUM_DPE from 32 to 0. 0 and 4 made the VNC runs smooth. I've also tested 8 and 16, but these two didn't help. Again, by setting the number to 0, this is the result: Total timestamps above 300 microseconds: 3149 of 65420 Percentage of timestamps above 300 microseconds: 4.81% Of course we also have more packages, as the VNC is working and it's not frozen. It's interesting how fast this is now, but as discussed with a coleague, we don't know if this tells much. There are more latencies envolved, for example from the scheduler, so probably the result is not showing me anything valuable. My second test was trying to figure out how TX_MAX_NUM_DPE was impacting my issue, so I added two printks to check the number it was getting and when it was being zeroed. /* Delay for the transmit to wait before sending an unfilled NTB frame. */ #define TX_TIMEOUT_NSECS 300000 @@ -964,6 +964,7 @@ static struct sk_buff *package_for_tx(struct f_ncm *ncm) /* Set the final NDP wLength */ new_len = opts->ndp_size + (ncm->ndp_dgram_count * dgram_idx_len); + printk("hfranco: package_for_tx called, setting ndp_dram_count to 0"); ncm->ndp_dgram_count = 0; /* Increment from start to wLength */ ntb_iter = (void *) ncm->skb_tx_ndp->data; @@ -1027,6 +1028,8 @@ static struct sk_buff *ncm_wrap_ntb(struct gether *port, * ready for new data. * NOTE: Assume maximum align for speed of calculation. */ + + printk("hfranco: ncm->ndp_dgram_count = %d", ncm->ndp_dgram_count); if (ncm->skb_tx_data && (ncm->ndp_dgram_count >= TX_MAX_NUM_DPE || (ncm->skb_tx_data->len + This is the result: [ 6613.034304] hfranco: package_for_tx called, setting ndp_dram_count to 0 [ 6613.034783] hfranco: ncm->ndp_dgram_count = 0 [ 6613.034812] hfranco: ncm->ndp_dgram_count = 2 [ 6613.034823] hfranco: ncm->ndp_dgram_count = 3 [ 6613.034835] hfranco: ncm->ndp_dgram_count = 4 [ 6613.034844] hfranco: ncm->ndp_dgram_count = 5 [ 6613.034855] hfranco: ncm->ndp_dgram_count = 6 [ 6613.034865] hfranco: ncm->ndp_dgram_count = 7 [ 6613.034874] hfranco: ncm->ndp_dgram_count = 8 [ 6613.034884] hfranco: ncm->ndp_dgram_count = 9 [ 6613.034895] hfranco: ncm->ndp_dgram_count = 10 [ 6613.034905] hfranco: ncm->ndp_dgram_count = 11 [ 6613.034914] hfranco: package_for_tx called, setting ndp_dram_count to 0 [ 6613.035266] hfranco: ncm->ndp_dgram_count = 2 [ 6613.035281] hfranco: ncm->ndp_dgram_count = 3 [ 6613.035292] hfranco: ncm->ndp_dgram_count = 4 [ 6613.035301] hfranco: ncm->ndp_dgram_count = 5 [ 6613.035310] hfranco: ncm->ndp_dgram_count = 6 [ 6613.035320] hfranco: ncm->ndp_dgram_count = 7 [ 6613.035329] hfranco: ncm->ndp_dgram_count = 8 [ 6613.035339] hfranco: ncm->ndp_dgram_count = 9 [ 6613.035350] hfranco: ncm->ndp_dgram_count = 10 [ 6613.035360] hfranco: ncm->ndp_dgram_count = 11 [ 6613.035366] hfranco: package_for_tx called, setting ndp_dram_count to 0 So, in general the value never passes 10 or 11, from all the test I done with the VNC. As said earlier, by putting 4 made the VNC work: [ 17.333014] hfranco: package_for_tx called, setting ndp_dram_count to 0 [ 17.333065] hfranco: ncm->ndp_dgram_count = 2 [ 17.333079] hfranco: ncm->ndp_dgram_count = 3 [ 17.333105] hfranco: ncm->ndp_dgram_count = 4 [ 17.333114] hfranco: package_for_tx called, setting ndp_dram_count to 0 [ 17.333161] hfranco: ncm->ndp_dgram_count = 2 [ 17.333173] hfranco: ncm->ndp_dgram_count = 3 [ 17.333184] hfranco: ncm->ndp_dgram_count = 4 [ 17.333191] hfranco: package_for_tx called, setting ndp_dram_count to 0 [ 17.333235] hfranco: ncm->ndp_dgram_count = 2 [ 17.333259] hfranco: ncm->ndp_dgram_count = 3 [ 17.333271] hfranco: ncm->ndp_dgram_count = 4 On Tue, Dec 12, 2023 at 10:21:13PM +0100, Oliver Neukum wrote: > On 12.12.23 21:32, Hiago De Franco wrote: > > Hi, > > > On Mon, Dec 11, 2023 at 12:44:42PM -0800, Maciej Żenczykowski wrote: > > > On Mon, Dec 11, 2023 at 12:29 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote: > > > > Hiago, could you try lowering CDC_NCM_TIMER_PENDING_CNT, if need be all the way to 1? > > > It is defined in include/linux/usb/cdc_ncm.h as 3 currently > > > This applies to the host side. > > > > On my side CDC_NCM_TIMER_PENDING_CNT is set to 2 by default, did you > > mean CDC_NCM_RESTART_TIMER_DATAGRAM_CNT? > > Yes, I meant that. Sorry. > > > Despite of that, I tried to lower both CDC_NCM_TIMER_PENDING_CNT and > > CDC_NCM_RESTART_TIMER_DATAGRAM_CNT all the way down to 1, first the > > CDC_NCM_TIMER_PENDING_CNT, then CDC_NCM_RESTART_TIMER_DATAGRAM_CNT and > > finally both at the same time, but it didn't help. > > > > I've also put some printks as following: > > > > ctx->tx_curr_frame_num = n; > > printk("hfranco: tx_curr_frame_num = %d", n); > > > > if (n == 0) { > > printk("hfranco: n == 0"); > > /* wait for more frames */ > > /* push variables */ > > ctx->tx_curr_skb = skb_out; > > goto exit_no_skb; > > > > } else if ((n < ctx->tx_max_datagrams) && (ready2send == 0) && (ctx->timer_interval > 0)) { > > printk("hfranco: tx_max_datagrams = %d", ctx->tx_max_datagrams); > > printk("hfranco: timer_interval = %d", ctx->timer_interval); > > printk("hfranco: n inside else if = %d", n); > > /* wait for more frames */ > > /* push variables */ > > ctx->tx_curr_skb = skb_out; > > /* set the pending count */ > > if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT) > > ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT; > > goto exit_no_skb; > > > > } else { > > printk("hfranco: n inside else = %d", n); > > if (n == ctx->tx_max_datagrams) > > ctx->tx_reason_max_datagram++; /* count reason for transmitting */ > > > > I ran it on my host PC, compiled it as module for my Debian dekstop, and > > this is the dmesg: > > > > [ 9663.478807] hfranco: tx_curr_frame_num = 1 > > [ 9663.478816] hfranco: tx_max_datagrams = 40 > > [ 9663.478818] hfranco: timer_interval = 400000 > > [ 9663.478820] hfranco: n inside else if = 1 > > [ 9663.478822] hfranco: timer started > > [ 9663.479645] hfranco: tx_curr_frame_num = 1 > > [ 9663.479652] hfranco: n inside else = 1 > > > > And then it basically repeats. Looks like 'n' never passes the 1 value. > > By tweaking the flags mentioned before, 'n' got a value of 4, but that > > was the maximum value. I was wondering, why do you think this code looks > > suspicious? I basically just inserted some printks on the tx side, I > > will see if I can get something from the rx as well. > > If we look at the statistics you initially gathered, we can see that all transmissions > on the host side happen because the timeout elapses. That, however, does > _not_ tell us that the host is to blame. We could look at two possible scenarios > > A - the gadget is bundling up the packets with too much delay and the host > just answers to the megatransmissions with one packet and the delay on the host > is inconsequential > > B - the timer on the host runs for too long or sometimes not at all. If that were > the case that code I pointed out would be most likely to blame > > Could I suggest we try to localize the issue? Can you ping the host from the device? Lastly, about Oliver's question, I tried to send an UDP package using netcat from host to device and then from device to host. I was sending one single packet in loop with 1 second interval first, then tried with 0.5 second and 0.1 second. My hope was, with this test, detect who was failing, host or device. For 1 second, it pretty much works on both directions. As example, this is the output of my host PC (sendind packages with the device -> "echo 'hi' | nc -u 192.168.11.2 5201"): $ sudo tcpdump -i enx9627798c3fa3 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on enx9627798c3fa3, link-type EN10MB (Ethernet), snapshot length 262144 bytes 17:47:51.865967 IP apalis-imx6-10692086.local.54882 > hdebian.5201: UDP, length 3 17:47:51.866051 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 17:47:52.145987 IP hdebian.mdns > mdns.mcast.net.mdns: 0 [b2&3=0x200] [2a] PTR (QM)? 1.11.168.192.in-addr.arpa. (134) 17:47:52.146020 IP hdebian.mdns > mdns.mcast.net.mdns: 0 [0q] (12) 17:47:52.198689 IP apalis-imx6-10692086.local.mdns > mdns.mcast.net.mdns: 0*- [0q] 1/0/0 (Cache flush) PTR apalis-imx6-10692086.local. (77) 17:47:52.882532 IP apalis-imx6-10692086.local.37745 > hdebian.5201: UDP, length 3 17:47:52.882634 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 17:47:53.903508 IP apalis-imx6-10692086.local.38147 > hdebian.5201: UDP, length 3 17:47:53.903585 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 17:47:54.920174 IP apalis-imx6-10692086.local.60718 > hdebian.5201: UDP, length 3 17:47:54.920265 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 17:47:55.940687 IP apalis-imx6-10692086.local.38729 > hdebian.5201: UDP, length 3 17:47:55.940743 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 17:47:56.913135 ARP, Request who-has apalis-imx6-10692086.local tell hdebian, length 28 17:47:56.915052 ARP, Reply apalis-imx6-10692086.local is-at ba:6d:d0:0a:95:6b (oui Unknown), length 28 17:47:56.956927 IP apalis-imx6-10692086.local.54649 > hdebian.5201: UDP, length 3 17:47:56.956966 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 17:47:57.980879 IP apalis-imx6-10692086.local.53366 > hdebian.5201: UDP, length 3 17:47:57.980984 IP hdebian > apalis-imx6-10692086.local: ICMP hdebian udp port 5201 unreachable, length 39 When I lower down the time between two packages from 1 second to 0.5 and 0.1, it suddenly freezes. It happens on both directions, host to device and device to host. As example, this is the host send UDP packets to the device with 0.1s interval: / # tcpdump -i usb0 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on usb0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 18:39:53.662594 IP 192.168.11.2.54802 > 192.168.11.1.5201: UDP, length 3 18:39:53.662716 IP 192.168.11.1 > 192.168.11.2: ICMP 192.168.11.1 udp port 5201 unreachable, length 39 18:39:53.768678 IP 192.168.11.2.49080 > 192.168.11.1.5201: UDP, length 3 18:39:53.768767 IP 192.168.11.1 > 192.168.11.2: ICMP 192.168.11.1 udp port 5201 unreachable, length 39 18:39:53.875305 IP 192.168.11.2.56938 > 192.168.11.1.5201: UDP, length 3 18:39:53.875376 IP 192.168.11.1 > 192.168.11.2: ICMP 192.168.11.1 udp port 5201 unreachable, length 39 18:39:53.982238 IP 192.168.11.2.40177 > 192.168.11.1.5201: UDP, length 3 18:39:53.982354 IP 192.168.11.1 > 192.168.11.2: ICMP 192.168.11.1 udp port 5201 unreachable, length 39 18:39:54.089279 IP 192.168.11.2.43284 > 192.168.11.1.5201: UDP, length 3 18:39:54.089378 IP 192.168.11.1 > 192.168.11.2: ICMP 192.168.11.1 udp port 5201 unreachable, length 39 18:39:54.196027 IP 192.168.11.2.41990 > 192.168.11.1.5201: UDP, length 3 18:39:54.196105 IP 192.168.11.1 > 192.168.11.2: ICMP 192.168.11.1 udp port 5201 unreachable, length 39 18:39:54.303793 IP 192.168.11.2.42362 > 192.168.11.1.5201: UDP, length 3 18:39:58.664608 ARP, Request who-has 192.168.11.2 tell 192.168.11.1, length 28 18:39:58.666781 ARP, Reply 192.168.11.2 is-at ea:2a:bc:40:41:30 (oui Unknown), length 28 18:39:58.669531 ARP, Request who-has 192.168.11.1 tell 192.168.11.2, length 28 18:39:58.669582 ARP, Reply 192.168.11.1 is-at 1e:be:d4:00:51:0e (oui Unknown), length 28 It stops after this ARP request. Sometimes it send packages again, but after a few seconds it freezes. So, at the end, I couldn't figure out where the issue is... > > Regards > Oliver On Tue, Dec 12, 2023 at 05:11:49PM -0800, Maciej Żenczykowski wrote: > On Tue, Dec 12, 2023 at 1:21 PM Oliver Neukum <oneukum@xxxxxxxx> wrote: > > > > On 12.12.23 21:32, Hiago De Franco wrote: > > > > Hi, > > > > > On Mon, Dec 11, 2023 at 12:44:42PM -0800, Maciej Żenczykowski wrote: > > >> On Mon, Dec 11, 2023 at 12:29 PM Hiago De Franco <hiagofranco@xxxxxxxxx> wrote: > > > > >> Hiago, could you try lowering CDC_NCM_TIMER_PENDING_CNT, if need be all the way to 1? > > >> It is defined in include/linux/usb/cdc_ncm.h as 3 currently > > >> This applies to the host side. > > > > > > On my side CDC_NCM_TIMER_PENDING_CNT is set to 2 by default, did you > > > mean CDC_NCM_RESTART_TIMER_DATAGRAM_CNT? > > > > Yes, I meant that. Sorry. > > > > > Despite of that, I tried to lower both CDC_NCM_TIMER_PENDING_CNT and > > > CDC_NCM_RESTART_TIMER_DATAGRAM_CNT all the way down to 1, first the > > > CDC_NCM_TIMER_PENDING_CNT, then CDC_NCM_RESTART_TIMER_DATAGRAM_CNT and > > > finally both at the same time, but it didn't help. > > > > > > I've also put some printks as following: > > > > > > ctx->tx_curr_frame_num = n; > > > printk("hfranco: tx_curr_frame_num = %d", n); > > > > > > if (n == 0) { > > > printk("hfranco: n == 0"); > > > /* wait for more frames */ > > > /* push variables */ > > > ctx->tx_curr_skb = skb_out; > > > goto exit_no_skb; > > > > > > } else if ((n < ctx->tx_max_datagrams) && (ready2send == 0) && (ctx->timer_interval > 0)) { > > > printk("hfranco: tx_max_datagrams = %d", ctx->tx_max_datagrams); > > > printk("hfranco: timer_interval = %d", ctx->timer_interval); > > > printk("hfranco: n inside else if = %d", n); > > > /* wait for more frames */ > > > /* push variables */ > > > ctx->tx_curr_skb = skb_out; > > > /* set the pending count */ > > > if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT) > > > ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT; > > > goto exit_no_skb; > > > > > > } else { > > > printk("hfranco: n inside else = %d", n); > > > if (n == ctx->tx_max_datagrams) > > > ctx->tx_reason_max_datagram++; /* count reason for transmitting */ > > > > > > I ran it on my host PC, compiled it as module for my Debian dekstop, and > > Shouldn't you be doing this on the gadget side? > I thought we were thinking it was the gadget transmit timer having issues. Because I still don't know where the issue is, I was testing both sides. > > > > this is the dmesg: > > > > > > [ 9663.478807] hfranco: tx_curr_frame_num = 1 > > > [ 9663.478816] hfranco: tx_max_datagrams = 40 > > > [ 9663.478818] hfranco: timer_interval = 400000 > > > [ 9663.478820] hfranco: n inside else if = 1 > > > [ 9663.478822] hfranco: timer started > > > [ 9663.479645] hfranco: tx_curr_frame_num = 1 > > > [ 9663.479652] hfranco: n inside else = 1 > > > > > > And then it basically repeats. Looks like 'n' never passes the 1 value. > > > By tweaking the flags mentioned before, 'n' got a value of 4, but that > > > was the maximum value. I was wondering, why do you think this code looks > > > suspicious? I basically just inserted some printks on the tx side, I > > > will see if I can get something from the rx as well. > > > > If we look at the statistics you initially gathered, we can see that all transmissions > > on the host side happen because the timeout elapses. That, however, does > > _not_ tell us that the host is to blame. We could look at two possible scenarios > > > > A - the gadget is bundling up the packets with too much delay and the host > > just answers to the megatransmissions with one packet and the delay on the host > > is inconsequential > > > > B - the timer on the host runs for too long or sometimes not at all. If that were > > the case that code I pointed out would be most likely to blame > > > > Could I suggest we try to localize the issue? Can you ping the host from the device? > > > > Regards > > Oliver > > -- > Maciej Żenczykowski, Kernel Networking Developer @ Google Sorry if I couldn't provide enough information, I'm kind of learning trought the process. The TX_MAX_NUM_DPE is something I'm not fully convinced, but again I will keep debugging. Best regards, Hiago.