Hi Benjamin Bigler, Thank you for your testing and feedback. It would be really helpful to bring the driver to a good shape. We really appreciate your efforts on this. On 24/03/24 5:25 pm, Benjamin Bigler wrote: > [Some people who received this message don't often get email from benjamin@xxxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe > > Hi Parthiban > > I hope I send this in the right context as it is not related to just one patch or > some specific code. > > I conducted UDP load testing using three i.MX8MM boards in conjunction with the > LAN8651. The setup involved one board functioning as a server, which is just > echoing back received data, while the remaining two boards acted as clients, > sending UDP packets of different sizes in various bursts to the server. > Due to hardware constraints, the SPI bus speed was limited to 15 MHz, which might > have influenced the results. > > During the tests I experienced some issues: > > - The boards just start receiving after first sending something (ping another board). > Some measurements showed that the irq stays asserted after init. This makes sense > as far as I understand the chapter 7.7 of the specification, the irq is deasserted > on reception of the first data header following CSn being asserted. As a workaround > I trigger the thread at the end of oa_tc6_init. It looks like the IRQ is asserted on RESET completion and expects a data chunk from host to deassert the IRQ. I used to test the driver in RPI 4 using iperf3. For some reason I never faced this issue, may be when the network device is being registered there might be some packet transmission which leads to deliver a data chunk so that the IRQ is deasserted. Thanks for the workaround. I think that would be the solution to solve this issue. Adding the below lines in the end of the function oa_tc6_init() will trigger the oa_tc6_spi_thread_handler() to perform an empty data chunk transfer which will deassert the IRQ before starting the actual data transfer. /* oa_tc6_sw_reset_macphy() function resets and clears the MAC-PHY reset * complete status. IRQ is also asserted on reset completion and it is * remain asserted until MAC-PHY receives a data chunk. So performing an * empty data chunk transmission will deassert the IRQ. Refer section * 7.7 and 9.2.8.8 in the OPEN Alliance specification for more details. */ tc6->int_flag = true; wake_up_interruptible(&tc6->spi_wq); > > - If there is a lot of traffic, the receive buffer overflow error spams the log. > > - If there is a lot of traffic, I got various kernel panics in oa_tc6_update_rx_skb. > Mostly because more data to rx_skb is added than allocated and sometimes because > rx_skb is null in oa_tc6_update_rx_skb or oa_tc6_prcs_rx_frame_end. Some debugging > with a logic analyzer showed that the chip is not behave correctly. There is more > bytes between start_valid and end_valid than there should be. Also there > seems to be 2 end_valid without a start_valid between. What is common is that the incorrect > frame starts in a chunk where end_valid and start_valid is set. > In my opinion its a problem in the chip (maybe related to the errata in the next point) > but the driver should be resilent and just drop the packet and not cause a kernel panic. Usually I run into this issue "receive buffer overflow" when I run RPI 4 with default cpu governor setting which is "ondemand". In this case, even though if I set SPI clock speed as 15 MHz the RPI 4 core clock is clocking down when it is idle which leads delivering half of the configured SPI clock speed around 5.9 MHz. So the systems like RPI 4 need performance mode enabled to get the proper clock speed for SPI. Refer below link for more details. https://github.com/raspberrypi/linux/issues/3381#issuecomment-1144723750 I used to enable performance mode using the below command. echo performance | sudo tee /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor > /dev/null So please ensure the SPI clock speed using a logic analyzer to get the maximum throughput without receive buffer overflow. Of course, I agree that the driver should not crash in case of receive buffer overflow. By referring your investigations, I understand that the buffers in the MAC-PHY is being continuously overwritten again and again as the host is very slow to read the data from the MAC-PHY buffers through SPI which alters the descriptors. There might be two reasons why we run into this situation. 1. The host is busy doing something else and delays to initiate SPI even though SPI clock speed is 15 MHz. 2. The SPI clock speed is less than 15 MHz. I use the below iperf3 setup for my testing and never faced the driver crash issue even though faced "receive buffer overflow" error when I run RPI 4 with "ondemand" default mode. Node 0 - Raspberry Pi 4 with LAN8650 MAC-PHY $ iperf3 -s Node 1 - Raspberry Pi 4 with EVB-LAN8670-USB USB Stick $ iperf3 -c 192.168.5.100 -u -b 10M -i 1 -t 0 and vice versa. I never faced "receive buffer overflow" error when I run RPI 4 with "performance" mode enabled and even though all the cores are stressed using the below command, $ yes >/dev/null & yes >/dev/null & yes >/dev/null & yes >/dev/null & Can you share more details about your testing setup and applications you use, so that I will try to reproduce the issue in my setup to debug the driver? > > - Sometimes the chip stops working. It always asserts the irq but there is no data (rca=0) > and also exst is not active. I found out that there is an errata (DS80001075) point s3 > that explains this. I set the ZARFE bit in CONFIG0. This also fixes the point above. > The driver now works since about 2.5 weeks with various load with just one loss of frame > error where I had to reboot the system after about 4 days. It is good to hear that the driver works fine with the above changes. As mentioned in the errata, this continuous interrupt issue is a known issue with LAN8651 Rev.B0. Switching to LAN8651 Rev.B1 will solve this issue and no need of any workaround. Setting ZARFE bit in the CONFIG0 will solve the continuous interrupt issue but don't know how the above "receive buffer overflow" issue also solved. I think it is a good idea to test with LAN8651 Rev.B1 without setting ZARFE bit once. It would be interesting to see the result. I am always using LAN8651 Rev.B1 for my testing. I should be able to reproduce the "receive buffer overflow" issue and consequently kernel crash in my setup with LAN8651 Rev.B1 so that I can investigate the issue further. As I am not able to reproduce in my RPI 4, I need your support for the tests and applications you used in your setup. > > Is there a reason why you removed the netdev watchdog which was active in v2? When the timeout occurs, there is no further action except increasing tx_errors. Not seeing this except USB-to-Ethernet which can be removed unexpectedly. But this is SPI interface which will not be removed unexpectedly as it is a platform device. That's why we removed this. Best regards, Parthiban V > > Thanks, > Benjamin Bigler >