Hi Carlo, On Sat, Dec 08, 2018 at 10:46:17AM +0000, Carlo Caione wrote: > On Fri, 2018-12-07 at 19:51 +0100, Emiliano Ingrassia wrote: > > Hi Carlo, > > Hi Emiliano, > > > tests[0] conducted on an Odroid-C1+ board equipped with a Meson8b SoC > > have shown an high packet loss (90% and more) during a simple ping > > test from a laptop to the board. > > Testing the two patches separately clearly showed that this depends > > on the > > removal of the "eee-broken-1000t" flag from the board PHY description > > in the relative device tree. > > > > About the first patch (MAC IRQ type), no tests have shown an evidence > > that it is needed. I suggest you to conduct some test on real > > hardware > > as I do to confirm or disprove my tests. > > Let's try to step back a bit and see what we can do to clarify this > situation. > Ok, I'll be glad to help you :) > First of all for arm64 we are pretty sure that both patches are needed > because we ran extensive and lengthy tests, especially regarding the > change in the IRQ trigger type. For arm things are not so clear, so for > now we decided to merge the arm64 patch and just wait on the arm one. > > First of all we can focus on the patch regarding the change in the IRQ > type. > > The problem with the IRQ type is triggered on the arm64 boards we > tested using the script in [0]. If we run this stress test on the arm64 > boards without the trigger changing patch after a few hours (variable > from 2h to 6h sometimes more) we can see the connection dropping from > ~1Gbps to <30Mbps. Jerome gave a nice explanation of the why, but after > changing the IRQ trigger type we couldn't see the issue anymore. This > was confirmed not just by BayLibre but also from other different > sources, so we are pretty confident in this solution. > > So my first two points for you to answer are: > > 1) Can you reproduce this problem on your board without the patches > when running this script? > > 2) If yes, does only the first patch solve the problem? > I ran two tests executing the script you provide on an Odroid-C1+ board (REV 0.4 20150930) for 6 hours, using my laptop as server. The kernel I used was compiled from "v4.21/dt64-testing" branch provided by Kevin Hilman (thank you Kevin!). The results are available in [0]. The first test (no-patch-iperf-20181211000039.log) was run with none of your patches applied. The second test (irq-patch-iperf-20181211130953.log) was run with only the patch about IRQ type applied. As you can see, I did not experiment exactly the problem you had but I see a more stable behavior with the IRQ type patch applied. > This brings us to the second issue, the one regarding the 'eee-broken- > 1000t' quirk. Since the two issues are strictly related we are > confident that the change in the IRQ type solves this problem as well > (and this was confirmed by Jerome as well on the arm64 boards). > The problem here is that, without the "eee-broken-1000t" flag, simple ping tests from an host to the board showed an high packet loss (about ~90%), even with the IRQ type patch applied. > For this case I cannot provide a real reproducer so we need only to > stress test the network with iperf3 trying to reproduce the issue. This > is also because we think that you approach of using UDP and your packet > generator probably is not the best way to test the patch given that (1) > using UDP is not reliable according to our tests, (2) there is an > asymmetry in TX/RX, (3) the packet loss could be due to the saturation > on the bandwidth, etc... > The tests I ran with the kernel packet generator showed interesting informations to me. The board dropped all incoming traffic when transmitting at full rate (~940 Mbps). Although there is an asymmetry in the transmission FIFOs size (Rx FIFO is twice as Tx FIFO), I would expect a result more similar to the one I had in step 2 of TEST 0 [1], after a while. However, this behavior could be due to the driver and not so interesting in this discussion ;) > So AFAIK the best way to test this problem is using iperf3, the same > way it is done in the script in [0]. I was not involved with this issue > 1 year and half ago but AFAIK this is the way it was reproduced. > > This brings me to more answers for you to answer: > > 3) Running iperf3 tests in TX / RX / TX+RX without the 'eee-broken- > 1000' quirk applied are you able to reproduce the EEE problem? > > 4) Any change when the 'eee-broken-1000' quirk is applied? > > When testing (3) and (4) also please check the status of the EEE using > ethtool. > > Hopefully this will bring a bit of clarity to the whole situation :) > > Cheers, > > [0] https://paste.fedoraproject.org/paste/GBFxjAQ0JULsYQlyYO2KOw > > -- > Carlo Caione > Best reagrds, Emiliano [0] https://drive.google.com/drive/folders/1BMe8vkm16KdgijlhFfZH_xph5eDNdkqO?usp=sharing [1] http://lists.infradead.org/pipermail/linux-amlogic/2018-December/009397.html