Hello,
I request the following patch from v4.10-rc1 to get cherry-picked into
"stable/linux-4.9.y":
commit f114dca2533ca770aebebffb5ed56e5e7d1fb3fb
Author: Alexander Duyck <alexander.h.duyck@xxxxxxxxx>
Date: Tue Oct 25 16:08:46 2016 -0700
i40e: Be much more verbose about what we can and cannot offload
This change makes it so that we are much more robust about defining what we
can and cannot offload. Previously we were just checking for the L4 tunnel
header length, however there are other fields we should be verifying as
there are multiple scenarios in which we cannot perform hardware offloads.
In addition the device only supports GSO as long as the MSS is 64 or
greater. We were not checking this so an MSS less than that was resulting
in Tx hangs.
Change-ID: I5e2fd5f3075c73601b4b36327b771c64fcb6c31b
Signed-off-by: Alexander Duyck <alexander.h.duyck@xxxxxxxxx>
Tested-by: Andrew Bowers <andrewx.bowers@xxxxxxxxx>
Debian had this old Bug
<https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=892105> reported
against 4.9.82, which still exists in Debians old-stable 9 "Stretch"
current kernel 4.9.258, but also with latest stable 4.9.273.
Our environment
===============
- KVM server
- dual port i40e
- classic bridge with enp96s0f0
- VM attached to bridge via veth
- no VLANs
- no MacVLan
# ethtool -i enp96s0f0
driver: i40e
version: 1.6.16-k
firmware-version: 3.33 0x80000e48 1.1876.0
expansion-rom-version:
bus-info: 0000:60:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: ye
# lspci -s 0000:60:00.0
60:00.0 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GBASE-T (rev 09)
Analysis
========
As soon as we start one of our "Ubuntu" images the bridge stops
receiving unicast packages for *all* VMs on that bridge.
- we still see outgoing traffic leaving the host, e.g. ARP requests
- "tcpdump -i enp96s0f0" shows no incoming unicast traffic, e.g. no ARP
response
- broadcast traffic passes the bridge
- VMs on the same bridge can communicate with each other
Most often I see the following error message after doing `dmesg -n 8`:
[ +9,376367] i40e 0000:60:00.0: cleared PE_CRITERR
[ +0,000252] i40e 0000:60:00.0: TX driver issue detected, PF reset issued
[ +0,859912] i40e 0000:60:00.0: Error I40E_AQ_RC_EINVAL adding RX filters on PF, promiscuous mode forced on
In one case I've seen this also (don't know if it is relevant):
[ 218.921466] i40e 0000:60:00.0 enp96s0f0: VSI_seid 390, Hung TX queue 43, tx_pending_hw: 1, NTC:0xa6, HWB: 0xa6, NTU: 0xa7, TAIL: 0xa7
[ 218.921470] i40e 0000:60:00.0 enp96s0f0: VSI_seid 390, Issuing force_wb for TX queue 43, Interrupt Reg: 0x0
After that error the only way to reset this broken state it to reboot
the host. I've been unable to tear down the bridge and/or remove the
`i40e` driver, which most often crashes the Linux kernel (some other bug
on `ip link set enp96s0f0 nomaster`).
If you need more data I have a PCAP file, but I still don't know which
packet exactly triggers the bug.
The bugs seems to be fixed with 4.10.0 and I bisected it down to
git bisect start '--' 'drivers/net/ethernet/intel/i40e'
# new: [c470abd4fde40ea6a0846a2beab642a578c0b8cd] Linux 4.10
git bisect new c470abd4fde40ea6a0846a2beab642a578c0b8cd
# old: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9
git bisect old 69973b830859bc6529a7a0468ba0d80ee5117826
# old: [13fd3f9cc3def8b276c7913ae4acbfa2653cb198] i40e: clear mac filter count on reset
git bisect old 13fd3f9cc3def8b276c7913ae4acbfa2653cb198
# new: [7ec9ba11b046b4b7fd768c366870ada60d409295] i40e: Driver prints log message on link speed change
git bisect new 7ec9ba11b046b4b7fd768c366870ada60d409295
# new: [0b7c8b5d5436317a5f4509e2a150c6cec017f348] i40e: fix trivial typo in naming of i40e_sync_filters_subtask
git bisect new 0b7c8b5d5436317a5f4509e2a150c6cec017f348
# new: [f114dca2533ca770aebebffb5ed56e5e7d1fb3fb] i40e: Be much more verbose about what we can and cannot offload
git bisect new f114dca2533ca770aebebffb5ed56e5e7d1fb3fb
# old: [81fa7c97bebd6e1a52c4e059eeffe18df5b3f11f] i40e: Implementation of ERROR state for NVM update state machine
git bisect old 81fa7c97bebd6e1a52c4e059eeffe18df5b3f11f
# old: [3aa7b74dbeedfb32406fec70cfd76d797209e8c9] i40e: removed unreachable code
git bisect old 3aa7b74dbeedfb32406fec70cfd76d797209e8c9
# first new commit: [f114dca2533ca770aebebffb5ed56e5e7d1fb3fb] i40e: Be much more verbose about what we can and cannot offload
I used v4.10 as the basis and only bisected everything in
drivers/net/ethernet/intel/i40e/ as vanilla v4.9 and several other
versions between that and v4.10 crashed my host, so basically
git checkout v4.10
git checkout $hash -- drivers/net/ethernet/intel/i40e/
make all modules_install install
git checkout v4-10 -- drivers/net/ethernet/intel/i40e/
git bisect (old|new) $hash
I verified that cherry-picking f114dca2533ca770aebebffb5ed56e5e7d1fb3fb
on top of v4.9.273 fixes the problem and reverting it again shows the
problem again.
Philipp
--
Philipp Hahn
Open Source Software Engineer
Univention GmbH
be open.
Mary-Somerville-Str. 1
D-28359 Bremen
📞 +49-421-22232-57
🖶 +49-421-22232-99
✉️ hahn@xxxxxxxxxxxxx
🌐 https://www.univention.de/
Geschäftsführer: Peter H. Ganten
HRB 20755 Amtsgericht Bremen
Steuer-Nr.: 71-597-02876