> From: Alexander Duyck <alexander.duyck@xxxxxxxxx> > Sent: Friday, December 18, 2020 9:31 PM > > On Thu, Dec 17, 2020 at 9:20 PM Parav Pandit <parav@xxxxxxxxxx> wrote: > > > > > > > From: Alexander Duyck <alexander.duyck@xxxxxxxxx> > > > Sent: Friday, December 18, 2020 8:41 AM > > > > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@xxxxxxxxx> > wrote: > > > > > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > > The problem is PCIe DMA wasn't designed to function as a network > > > switch fabric and when we start talking about a 400Gb NIC trying to > > > handle over 256 subfunctions it will quickly reduce the > > > receive/transmit throughput to gigabit or less speeds when encountering > hardware multicast/broadcast replication. > > > With 256 subfunctions a simple 60B ARP could consume more than 19KB > > > of PCIe bandwidth due to the packet having to be duplicated so many > > > times. In my mind it should be simpler to simply clone a single skb > > > 256 times, forward that to the switchdev ports, and have them > > > perform a bypass (if available) to deliver it to the subfunctions. > > > That's why I was thinking it might be a good time to look at addressing it. > > Linux tc framework is rich to address this and already used by openvswich > for years now. > > Today arp broadcasts are not offloaded. They go through software path > and replicated in the L2 domain. > > It is a solved problem for many years now. > > When you say they are replicated in the L2 domain I assume you are talking > about the software switch connected to the switchdev ports. Yes. > My question is > what are you doing with them after you have replicated them? I'm assuming > they are being sent to the other switchdev ports which will require a DMA to > transmit them, and another to receive them on the VF/SF, or are you saying > something else is going on here? > Yes, that is correct. > My argument is that this cuts into both the transmit and receive DMA > bandwidth of the NIC, and could easily be avoided in the case where SF > exists in the same kernel as the switchdev port by identifying the multicast > bit being set and simply bypassing the device. It probably can be avoided but its probably not worth for occasional ARP packets on neighbor cache miss. If I am not mistaken, even some recent HW can forward such ARP packets to multiple switchcev ports with commit 7ee3f6d2486e without following the above described DMA path.