Re: [PATCH] Add provision to busyloop for events in ep_poll.

Joe Damato <jdamato@xxxxxxxxxx> · Wed, 4 Sep 2024 17:58:26 +0200

On Tue, Sep 03, 2024 at 07:18:14PM -0700, Naman Gulati wrote:
> Thanks all for the comments and apologies for the delay in replying.
> Stanislav and Joe I’ve addressed some of the common concerns below.
> 
> On Thu, Aug 29, 2024 at 3:40 AM Joe Damato <jdamato@xxxxxxxxxx> wrote:
> >
> > On Wed, Aug 28, 2024 at 06:10:11PM +0000, Naman Gulati wrote:
> > > NAPI busypolling in ep_busy_loop loops on napi_poll and checks for new
> > > epoll events after every napi poll. Checking just for epoll events in a
> > > tight loop in the kernel context delivers latency gains to applications
> > > that are not interested in napi busypolling with epoll.
> > >
> > > This patch adds an option to loop just for new events inside
> > > ep_busy_loop, guarded by the EPIOCSPARAMS ioctl that controls epoll napi
> > > busypolling.
> >
> > This makes an API change, so I think that linux-api@xxxxxxxxxxxxxxx
> > needs to be CC'd ?
> >
> > > A comparison with neper tcp_rr shows that busylooping for events in
> > > epoll_wait boosted throughput by ~3-7% and reduced median latency by
> > > ~10%.
> > >
> > > To demonstrate the latency and throughput improvements, a comparison was
> > > made of neper tcp_rr running with:
> > >     1. (baseline) No busylooping
> >
> > Is there NAPI-based steering to threads via SO_INCOMING_NAPI_ID in
> > this case? More details, please, on locality. If there is no
> > NAPI-based flow steering in this case, perhaps the improvements you
> > are seeing are a result of both syscall overhead avoidance and data
> > locality?
> 
> The benchmarks were run with no NAPI steering.
> 
> Regarding syscall overhead, I reproduced the above experiment with
> mitigations=off
> and found similar results as above. Pointing to the fact that the above
> gains are
> materialized from more than just avoiding syscall overhead.

I think the cover letter needs more benchmark cases, data points,
and documentation about precisely how the benchmark is being run and
configured. The data feels quite sparse, the explanation of the
experiment leaves too much out, and - with respect - I am not
convinced by the results posted.

> By locality are you referring to Numa locality?

CPU cache locality.

> >
> > >     2. (epoll busylooping) enabling the epoll busy looping on all epoll
> > >     fd's
> >
> > This is the case you've added, event_poll_only ? It seems like in
> > this case you aren't busy looping exactly, you are essentially
> > allowing IRQ/softIRQ to drive processing and checking on wakeup that
> > events are available.
> >
> > IMHO, I'm not sure if "epoll busylooping" is an appropriate
> > description.
> 
> I see your point, perhaps "spinning" or just “looping” could be a closer
> word?

I don't know how to describe the patch, to be totally honest. I
think the basis of the mechanism needs to be more thoroughly
understood. See below.

> >
> > >     3. (userspace busylooping) looping on epoll_wait in userspace
> > >     with timeout=0
> >
> > Same question as Stanislav; timeout=0 should get ep_loop to transfer
> > events immediately (if there are any) and return without actually
> > invoking busy poll. So, it would seem that your ioctl change
> > shouldn't be necessary since the equivalent behavior is already
> > possible with timeout=0.
> >
> > I'd probably investigate both syscall overhead and data locality
> > before approving this patch because it seems a bit suspicious to me.
> >
> > >
> > > Stats for two machines with 100Gbps NICs running tcp_rr with 5 threads
> > > and varying flows:
> > >
> > > Type                Flows   Throughput             Latency (μs)
> > >                              (B/s)      P50   P90    P99   P99.9
>  P99.99
> > > baseline            15            272145      57.2  71.9   91.4  100.6
>  111.6
> > > baseline            30            464952      66.8  78.8   98.1  113.4
>  122.4
> > > baseline            60            695920      80.9  118.5  143.4 161.8
>  174.6
> > > epoll busyloop      15            301751      44.7  70.6   84.3  95.4
>   106.5
> > > epoll busyloop      30            508392      58.9  76.9   96.2  109.3
>  118.5
> > > epoll busyloop      60            745731      77.4  106.2  127.5 143.1
>  155.9
> > > userspace busyloop  15            279202      55.4  73.1   85.2  98.3
>   109.6
> > > userspace busyloop  30            472440      63.7  78.2   96.5  112.2
>  120.1
> > > userspace busyloop  60            720779      77.9  113.5  134.9 152.6
>  165.7
> > >
> > > Per the above data epoll busyloop outperforms baseline and userspace
> > > busylooping in both throughput and latency. As the density of flows per
> > > thread increased, the median latency of all three epoll mechanisms
> > > converges. However epoll busylooping is better at capturing the tail
> > > latencies at high flow counts.
> > >
> > > Signed-off-by: Naman Gulati <namangulati@xxxxxxxxxx>
> > > ---
> > >  fs/eventpoll.c                 | 53 ++++++++++++++++++++++++++--------
> > >  include/uapi/linux/eventpoll.h |  3 +-
> > >  2 files changed, 43 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > > index f53ca4f7fcedd..6cba79261817a 100644
> > > --- a/fs/eventpoll.c
> > > +++ b/fs/eventpoll.c
> > > @@ -232,7 +232,10 @@ struct eventpoll {
> > >       u32 busy_poll_usecs;
> > >       /* busy poll packet budget */
> > >       u16 busy_poll_budget;
> > > -     bool prefer_busy_poll;
> > > +     /* prefer to busypoll in napi poll */
> > > +     bool napi_prefer_busy_poll;
> >
> > Adding napi seems slightly redundant to me but I could be convinced either
> > way, I suppose.
> 
> With the two different polling booleans in the struct, I felt it's better
> to be explicit
> about the scope of each.
> 
> >
> > > +     /* avoid napi poll when busy looping and poll only for events */
> > > +     bool event_poll_only;
> >
> > I'm not sure about this overall; this isn't exactly what I think of
> > when I think about the word "polling" but maybe I'm being too
> > nit-picky.
> 
> I'm not sure how else to categorize the operation, is there some other
> phrasing
> you'd recommend?
> 
> >
> > >  #endif
> > >
> > >  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> > > @@ -430,6 +433,24 @@ static bool ep_busy_loop_end(void *p, unsigned
> long start_time)
> > >       return ep_events_available(ep) ||
> busy_loop_ep_timeout(start_time, ep);
> > >  }
> > >
> > > +/**
> > > + * ep_event_busy_loop - loop until events available or busy poll
> > > + * times out.
> > > + *
> > > + * @ep: Pointer to the eventpoll context.
> > > + *
> > > + * Return: true if events available, false otherwise.
> > > + */
> > > +static bool ep_event_busy_loop(struct eventpoll *ep)
> > > +{
> > > +     unsigned long start_time = busy_loop_current_time();
> > > +
> > > +     while (!ep_busy_loop_end(ep, start_time))
> > > +             cond_resched();
> > > +
> > > +     return ep_events_available(ep);
> > > +}
> > > +
> > >  /*
> > >   * Busy poll if globally on and supporting sockets found && no events,
> > >   * busy loop will return if need_resched or ep_events_available.
> > > @@ -440,23 +461,29 @@ static bool ep_busy_loop(struct eventpoll *ep,
> int nonblock)
> > >  {
> > >       unsigned int napi_id = READ_ONCE(ep->napi_id);
> > >       u16 budget = READ_ONCE(ep->busy_poll_budget);
> > > -     bool prefer_busy_poll = READ_ONCE(ep->prefer_busy_poll);
> > > +     bool event_poll_only = READ_ONCE(ep->event_poll_only);
> > >
> > >       if (!budget)
> > >               budget = BUSY_POLL_BUDGET;
> > >
> > > -     if (napi_id >= MIN_NAPI_ID && ep_busy_loop_on(ep)) {
> > > +     if (!ep_busy_loop_on(ep))
> > > +             return false;
> > > +
> > > +     if (event_poll_only) {
> > > +             return ep_event_busy_loop(ep);
> > > +     } else if (napi_id >= MIN_NAPI_ID) {
> > > +             bool napi_prefer_busy_poll =
> READ_ONCE(ep->napi_prefer_busy_poll);
> > > +
> > >               napi_busy_loop(napi_id, nonblock ? NULL :
> ep_busy_loop_end,
> > > -                            ep, prefer_busy_poll, budget);
> > > +                             ep, napi_prefer_busy_poll, budget);
> > >               if (ep_events_available(ep))
> > >                       return true;
> > >               /*
> > > -              * Busy poll timed out.  Drop NAPI ID for now, we can add
> > > -              * it back in when we have moved a socket with a valid
> NAPI
> > > -              * ID onto the ready list.
> > > -              */
> > > +             * Busy poll timed out.  Drop NAPI ID for now, we can add
> > > +             * it back in when we have moved a socket with a valid NAPI
> > > +             * ID onto the ready list.
> > > +             */
> > >               ep->napi_id = 0;
> > > -             return false;
> > >       }
> > >       return false;
> > >  }
> > > @@ -523,13 +550,15 @@ static long ep_eventpoll_bp_ioctl(struct file
> *file, unsigned int cmd,
> > >
> > >               WRITE_ONCE(ep->busy_poll_usecs,
> epoll_params.busy_poll_usecs);
> > >               WRITE_ONCE(ep->busy_poll_budget,
> epoll_params.busy_poll_budget);
> > > -             WRITE_ONCE(ep->prefer_busy_poll,
> epoll_params.prefer_busy_poll);
> > > +             WRITE_ONCE(ep->napi_prefer_busy_poll,
> epoll_params.prefer_busy_poll);
> > > +             WRITE_ONCE(ep->event_poll_only,
> epoll_params.event_poll_only);
> > >               return 0;
> > >       case EPIOCGPARAMS:
> > >               memset(&epoll_params, 0, sizeof(epoll_params));
> > >               epoll_params.busy_poll_usecs =
> READ_ONCE(ep->busy_poll_usecs);
> > >               epoll_params.busy_poll_budget =
> READ_ONCE(ep->busy_poll_budget);
> > > -             epoll_params.prefer_busy_poll =
> READ_ONCE(ep->prefer_busy_poll);
> > > +             epoll_params.prefer_busy_poll =
> READ_ONCE(ep->napi_prefer_busy_poll);
> > > +             epoll_params.event_poll_only =
> READ_ONCE(ep->event_poll_only);
> > >               if (copy_to_user(uarg, &epoll_params,
> sizeof(epoll_params)))
> > >                       return -EFAULT;
> > >               return 0;
> > > @@ -2203,7 +2232,7 @@ static int do_epoll_create(int flags)
> > >  #ifdef CONFIG_NET_RX_BUSY_POLL
> > >       ep->busy_poll_usecs = 0;
> > >       ep->busy_poll_budget = 0;
> > > -     ep->prefer_busy_poll = false;
> > > +     ep->napi_prefer_busy_poll = false;
> > >  #endif
> >
> > Just FYI: This is going to conflict with a patch I've sent to VFS
> > that hasn't quite made its way back to net-next just yet.
> >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.misc&id=4eb76c5d9a8851735fd3ec5833ecf412e8921655
> >
> Acknowledged.
> 
> > >       ep->file = file;
> > >       fd_install(fd, file);
> > > diff --git a/include/uapi/linux/eventpoll.h
> b/include/uapi/linux/eventpoll.h
> > > index 4f4b948ef3811..3bc0f4eed976c 100644
> > > --- a/include/uapi/linux/eventpoll.h
> > > +++ b/include/uapi/linux/eventpoll.h
> > > @@ -89,9 +89,10 @@ struct epoll_params {
> > >       __u32 busy_poll_usecs;
> > >       __u16 busy_poll_budget;
> > >       __u8 prefer_busy_poll;
> > > +     __u8 event_poll_only:1;
> > >
> > >       /* pad the struct to a multiple of 64bits */
> > > -     __u8 __pad;
> > > +     __u8 __pad:7;
> > >  };
> >
> > If the above is accepted then a similar change should make its way
> > into glibc, uclibc-ng, and musl. It might be easier to add an
> > entirely new ioctl.
> 
> Adding a new ioctl seems preferable, I can look into reworking the code
> accordingly.

My advice is prior to reworking the code: figure out why the results
show an improvement because I can't really see a clear explanation.

If I understand the patch correctly (and perhaps I've gotten it
wrong) it is proposing to add a new mechanism for packet processing
that is quite strange:

  Let softIRQ do the work (so there's no 'polling'), but avoid
  returning to userland and call cond_resched() to defer execution
  to other things until there's events to be returned.

Couldn't this be approximated with a smaller batch size (maxevents)
and a timeout of 0?

Perhaps consider comparing performance with a more real world
example (i.e. not tcp_rr) and use NAPI-steering in the base case to
eliminate cache locality as a variable? Maybe even compare against
the suspend mechanism Martin and I proposed?

- Joe