On Fri, Apr 08, 2022 at 11:17:56AM -0700, Jakub Kicinski wrote: > On Fri, 8 Apr 2022 15:48:44 +0300 Maxim Mikityanskiy wrote: > > >> 4. A slow or malicious AF_XDP application may easily cause an overflow of > > >> the hardware receive ring. Your feature introduces a mechanism to pause the > > >> driver while the congestion is on the application side, but no symmetric > > >> mechanism to pause the application when the driver is close to an overflow. > > >> I don't know the behavior of Intel NICs on overflow, but in our NICs it's > > >> considered a critical error, that is followed by a recovery procedure, so > > >> it's not something that should happen under normal workloads. > > > > > > I'm not sure I follow on this one. Feature is about overflowing the XSK > > > receive ring, not the HW one, right? > > > > Right. So we have this pipeline of buffers: > > > > NIC--> [HW RX ring] --NAPI--> [XSK RX ring] --app--> consumes packets > > > > Currently, when the NIC puts stuff in HW RX ring, NAPI always runs and > > drains it either to XSK RX ring or to /dev/null if XSK RX ring is full. > > The driver fulfills its responsibility to prevent overflows of HW RX > > ring. If the application doesn't consume quick enough, the frames will > > be leaked, but it's only the application's issue, the driver stays > > consistent. > > > > After the feature, it's possible to pause NAPI from the userspace > > application, effectively disrupting the driver's consistency. I don't > > think an XSK application should have this power. > > +1 > cover letter refers to busy poll, but did that test enable prefer busy > poll w/ the timeout configured right? It seems like similar goal can > be achieved with just that. AF_XDP busy poll where app and driver runs on same core, without configuring gro_flush_timeout and napi_defer_hard_irqs does not bring much value, so all of the busy poll tests were done with: echo 2 | sudo tee /sys/class/net/ens4f1/napi_defer_hard_irqs echo 200000 | sudo tee /sys/class/net/ens4f1/gro_flush_timeout That said, performance can still suffer and packets would not make it up to user space even with timeout being configured in the case I'm trying to improve.