RE: Excessive network latency when using Realtek R8168/R8111 et al NIC

Stephane ANCELOT <sancelot@xxxxxxxxxxxxxxx> · Wed, 24 May 2023 09:37:06 +0000

Hi,

Hi,

I worked on realtime realtek chips drivers in the past.
I remember there was a delay register to trigg the interrupt  in the realtek component.

This may be a similar problem with this chip.

Refs https://patchwork.ozlabs.org/project/netdev/patch/20121127224605.GA10228@xxxxxxxxxxxxxxxxxxxxxxxxxx/

Stéphane ANCELOT

> -----Message d'origine-----
> De : Marcelo Tosatti <mtosatti@xxxxxxxxxx>
> Envoyé : mercredi 24 mai 2023 01:04
> À : Rod Webster <rod@xxxxxxxxxx>
> Cc : Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>; linux-rt-
> users@xxxxxxxxxxxxxxx
> Objet : Re: Excessive network latency when using Realtek R8168/R8111 et al
> NIC
> 
>         External Sender: Use caution with links/attachments.
> Expediteur externe : Soyez prudent avec les liens/pieces jointes.
> 
> On Tue, May 23, 2023 at 06:02:13AM +1000, Rod Webster wrote:
> > This stuff is hard! I just realised that rtapi_app is a red herring!
> > rtapi_app is Linuxcnc and there is nothing wrong with it. Its thread
> > is on a 1000us cycle so it seems it gets all its jobs done in 200us
> > and then sleeps for 800us which makes perfect sense!
> >
> > The issue we have is deeper than that. I think we should be looking at
> > the NIC interrupt (but don't trust the novice!).
> > The network communication is consuming more than the 800us slack from
> > time to time. When that happens, our hardware sees the timing overrun
> > and increments an internal packet error count. If too many of these
> > happen in succession, the hardware decides the RT environment can't be
> > relied on, disables further communication and returns an "error
> > finishing read" to Linuxcnc to say it's given up.
> >
> > Marcelo, we didn't resort to C. We were able to use a bash script and
> > use a linuxcnc tool called halcmd to query the hardware as shown here.
> > #!/usr/bin/bash
> > stat=0
> > while (($stat < 1))
> > do
> > stat=`halcmd getp hm2_7i96s.0.packet-error-total` done trace-cmd stop
> 
> You can also write a value to the trace before stopping it:
> 
> echo "pin high, stopping tracing!" > /sys/kernel/debug/tracing/trace_marker
> 
> Just to make sure you are seeing the trace up to the correct event.
> 
> > I think we need to increase the stat threshold  so we get more samples
> > in our trace before stopping it. The current trace will only have one
> > instance.
> > Thanks for letting me see the issue more clearly.
> >
> >
> > Rod Webster
> >
> >
> >
> > Rod Webster
> >
> > VMN®
> >
> > www.vmn.com.au
> >
> > Ph: 1300 896 832
> >
> > Mob: +61 435 765 611
> >
> >
> >
> >
> > On Tue, 23 May 2023 at 03:13, Marcelo Tosatti <mtosatti@xxxxxxxxxx>
> wrote:
> > >
> > > On Mon, May 22, 2023 at 08:06:39PM +1000, Rod Webster wrote:
> > > > Yes, I won't be using a DKMS driver and will test with the in-tree driver.
> > > > I rolled a PC back to the Debian 6.1 RT kernel and the default
> > > > in-tree
> > > > R8169 driver yesterday with isolcpus=2,3.
> > > >
> > > > I had a shot at it last night with the napi thread but was
> > > > confused by the trace results so will try again.
> > > > I could not see our PacketErrorTest script appearing in the trace
> > > > but it stopped the trace when it should.
> > >
> > > Rod,
> > >
> > > Can you describe the PacketErrorTest script and how do you know,
> > > from it, that latency has crossed the threshold? What was confusing
> > > about the traces?
> > >
> > > Earlier you wrote:
> > > I've had some feedback from a hardware supplier.
> > > The Linuxcnc RT ethernet driver has a pin that reports ethernet
> > > packet error count.
> > > This increments when the ethernet response exceeds our real time
> > > servo thread period (typically 1 ms) We should be able to monitor
> > > this pin and stop the trace if it is >  0 without too much effort.
> > >
> > > So you are seeing that pin increase its value?
> > >
> > > When you do, you should write to the "tracing_on" file (a zero):
> > >
> > >         /*
> > >          * if we're not tracing and the tracing_on fd is not open,
> > >          * open the tracing_on file so that we can stop the trace
> > >          * if we hit a breaktrace threshold
> > >          */
> > >         if (trace_fd < 0) {
> > >                 sprintf(path, "%s/%s", fileprefix, "tracing_on");
> > >                 if ((trace_fd = open(path, O_WRONLY)) < 0)
> > >                         warn("unable to open tracing_on file: %s\n", path);
> > >         }
> > >
> > >         /* now stop any trace */
> > >         write(trace_fd, "0\n", 2);
> > >
> > > The trace file (/sys/kernel/debug/tracing/trace) should now contain
> > > the trace up to the write to tracing_on file.
> > >
> > > Sebastian: didnt see anything weird in the traces as well:
> > >
> > > "Follows the delta between a list of timestamped events, where each
> > > event is: "TIMESTAMP: rtapi is scheduled IN CPU-3"
> > > (there is at one time only a single process scheduled in cpu-3).
> > >
> > > [191.050245, 191.050278]: 0.000033
> > > [191.050278, 191.050293]: 0.000015
> > > [191.050293, 191.050307]: 0.000014
> > > [191.050307, 191.050322]: 0.000015
> > > [191.050322, 191.050336]: 0.000014
> > >
> > > rtapi_app goes to sleep for approximately 15us, and is awakened by a
> > > timer.
> > >
> > > [191.050336, 191.050350]: 0.000014
> > > [191.050350, 191.050365]: 0.000015
> > > [191.050365, 191.050379]: 0.000014
> > > [191.050379, 191.050393]: 0.000014
> > > [191.050393, 191.050408]: 0.000015
> > > [191.050408, 191.051243]: 0.000835
> > > [191.051243, 191.051272]: 0.000029
> > >
> > > It eventually sleeps for 800us (and this pattern repeats)".
> > >
> > > Perhaps tracing the system call which sends out the packet would be
> > > useful as well.
> > >
> > > > Also our RT process did not appear to be on an isolated thread
> > > > which seemed odd. I did not get as far as setting the affinity for napi.
> > > >
> > > > I've also installed  Bookworm RC3 on another PC with the default
> > > > RT kernel on a Gigabyte Brix N3160 with an R8111/R8168 NIC and the
> > > > in-tree R8169 driver.
> > > > This PC historically had max latency of about 130000 ns  back with
> > > > kernel 4.x and now its reporting 23000 ns with the 6.1 kernel over
> > > > 24 hours testing with the linuxcnc tools. Quite amazing what you
> > > > guys have achieved! It will be a few days before I can lug my
> > > > hardware to work to test on it as well for another data point.
> > > >
> > > > I really appreciate your willingness to help. This remains a
> > > > frustrating issue for many Linuxcnc users who are using my
> > > > compiled
> > > > 6.3 kernel to get out of jail.
> > > > Bear with me as I bumble my way through!
> > > >
> > > > Rod Webster
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Rod Webster
> > > >
> > > > VMN®
> > > >
> > > > www.vmn.com.au
> > > >
> > > > Ph: 1300 896 832
> > > >
> > > > Mob: +61 435 765 611
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, 22 May 2023 at 19:32, Sebastian Andrzej Siewior
> > > > <bigeasy@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On 2023-05-19 21:41:30 [+1000], Rod Webster wrote:
> > > > > > I will do some testing tomorrow now I know what's going on but
> > > > > > I wonder how relevant it will be given this turnaround?
> > > > >
> > > > > I'm not going to look at the dkms/OOT driver. If there is an
> > > > > actual problem or something odd with mainline and RT and I will look
> into it.
> > > > > I managed to find an older box which is using the r8169 and I'm
> > > > > going to boot and see if there the driver is doing something odd
> > > > > what I didn't see earlier…
> > > > >
> > > > > Please make sure that you are using the in-tree driver. If you
> > > > > experience high latencies then I'm willing to help investigate.
> > > > >
> > > > > > Rod Webster
> > > > >
> > > > > Sebastian
> > > >
> > > >
> > >
> >
> >