> From: Parav Pandit <parav@xxxxxxxxxx> > Sent: Friday, December 18, 2020 10:51 AM > > > From: Alexander Duyck <alexander.duyck@xxxxxxxxx> > > Sent: Friday, December 18, 2020 8:41 AM > > > > On Thu, Dec 17, 2020 at 5:30 PM David Ahern <dsahern@xxxxxxxxx> wrote: > > > > > > On 12/16/20 3:53 PM, Alexander Duyck wrote: > > The problem is PCIe DMA wasn't designed to function as a network > > switch fabric and when we start talking about a 400Gb NIC trying to > > handle over 256 subfunctions it will quickly reduce the > > receive/transmit throughput to gigabit or less speeds when encountering > hardware multicast/broadcast replication. > > With 256 subfunctions a simple 60B ARP could consume more than 19KB of > > PCIe bandwidth due to the packet having to be duplicated so many > > times. In my mind it should be simpler to simply clone a single skb > > 256 times, forward that to the switchdev ports, and have them perform > > a bypass (if available) to deliver it to the subfunctions. That's why > > I was thinking it might be a good time to look at addressing it. > Linux tc framework is rich to address this and already used by openvswich for > years now. > Today arp broadcasts are not offloaded. They go through software patch and s/patch/path > replicated in the L2 domain. > It is a solved problem for many years now.