Re: [PATCH net-next v2] tun: support NAPI for packets received from batched XDP buffs

Jason Wang <jasowang@xxxxxxxxxx> · Mon, 28 Feb 2022 12:20:06 +0800

On Mon, Feb 28, 2022 at 12:06 PM Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
>
>
> On 2/25/22 01:02, Harold Huang wrote:
> > In tun, NAPI is supported and we can also use NAPI in the path of
> > batched XDP buffs to accelerate packet processing. What is more, after
> > we use NAPI, GRO is also supported. The iperf shows that the throughput of
> > single stream could be improved from 4.5Gbps to 9.2Gbps. Additionally, 9.2
> > Gbps nearly reachs the line speed of the phy nic and there is still about
> > 15% idle cpu core remaining on the vhost thread.
> >
> > Test topology:
> >
> > [iperf server]<--->tap<--->dpdk testpmd<--->phy nic<--->[iperf client]
> >
> > Iperf stream:
> >
> > Before:
> > ...
> > [  5]   5.00-6.00   sec   558 MBytes  4.68 Gbits/sec    0   1.50 MBytes
> > [  5]   6.00-7.00   sec   556 MBytes  4.67 Gbits/sec    1   1.35 MBytes
> > [  5]   7.00-8.00   sec   556 MBytes  4.67 Gbits/sec    2   1.18 MBytes
> > [  5]   8.00-9.00   sec   559 MBytes  4.69 Gbits/sec    0   1.48 MBytes
> > [  5]   9.00-10.00  sec   556 MBytes  4.67 Gbits/sec    1   1.33 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  5.39 GBytes  4.63 Gbits/sec   72          sender
> > [  5]   0.00-10.04  sec  5.39 GBytes  4.61 Gbits/sec               receiver
> >
> > After:
> > ...
> > [  5]   5.00-6.00   sec  1.07 GBytes  9.19 Gbits/sec    0   1.55 MBytes
> > [  5]   6.00-7.00   sec  1.08 GBytes  9.30 Gbits/sec    0   1.63 MBytes
> > [  5]   7.00-8.00   sec  1.08 GBytes  9.25 Gbits/sec    0   1.72 MBytes
> > [  5]   8.00-9.00   sec  1.08 GBytes  9.25 Gbits/sec   77   1.31 MBytes
> > [  5]   9.00-10.00  sec  1.08 GBytes  9.24 Gbits/sec    0   1.48 MBytes
> > - - - - - - - - - - - - - - - - - - - - - - - - -
> > [ ID] Interval           Transfer     Bitrate         Retr
> > [  5]   0.00-10.00  sec  10.8 GBytes  9.28 Gbits/sec  166          sender
> > [  5]   0.00-10.04  sec  10.8 GBytes  9.24 Gbits/sec               receiver
> > ....
> >
> > Reported-at: https://lore.kernel.org/all/CACGkMEvTLG0Ayg+TtbN4q4pPW-ycgCCs3sC3-TF8cuRTf7Pp1A@xxxxxxxxxxxxxx
> > Signed-off-by: Harold Huang <baymaxhuang@xxxxxxxxx>
> > ---
> > v1 -> v2
> >   - fix commit messages
> >   - add queued flag to avoid void unnecessary napi suggested by Jason
> >
> >   drivers/net/tun.c | 20 ++++++++++++++++----
> >   1 file changed, 16 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index fed85447701a..c7d8b7c821d8 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -2379,7 +2379,7 @@ static void tun_put_page(struct tun_page *tpage)
> >   }
> >
> >   static int tun_xdp_one(struct tun_struct *tun,
> > -                    struct tun_file *tfile,
> > +                    struct tun_file *tfile, int *queued,
> >                      struct xdp_buff *xdp, int *flush,
> >                      struct tun_page *tpage)
> >   {
> > @@ -2388,6 +2388,7 @@ static int tun_xdp_one(struct tun_struct *tun,
> >       struct virtio_net_hdr *gso = &hdr->gso;
> >       struct bpf_prog *xdp_prog;
> >       struct sk_buff *skb = NULL;
> > +     struct sk_buff_head *queue;
> >       u32 rxhash = 0, act;
> >       int buflen = hdr->buflen;
> >       int err = 0;
> > @@ -2464,7 +2465,15 @@ static int tun_xdp_one(struct tun_struct *tun,
> >           !tfile->detached)
> >               rxhash = __skb_get_hash_symmetric(skb);
> >
> > -     netif_receive_skb(skb);
> > +     if (tfile->napi_enabled) {
> > +             queue = &tfile->sk.sk_write_queue;
> > +             spin_lock(&queue->lock);
> > +             __skb_queue_tail(queue, skb);
> > +             spin_unlock(&queue->lock);
> > +             (*queued)++;
> > +     } else {
> > +             netif_receive_skb(skb);
> > +     }
> >
> >       /* No need to disable preemption here since this function is
> >        * always called with bh disabled
> > @@ -2492,7 +2501,7 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> >       if (ctl && (ctl->type == TUN_MSG_PTR)) {
> >               struct tun_page tpage;
> >               int n = ctl->num;
> > -             int flush = 0;
> > +             int flush = 0, queued = 0;
> >
> >               memset(&tpage, 0, sizeof(tpage));
> >
> > @@ -2501,12 +2510,15 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
> >
> >               for (i = 0; i < n; i++) {
> >                       xdp = &((struct xdp_buff *)ctl->ptr)[i];
> > -                     tun_xdp_one(tun, tfile, xdp, &flush, &tpage);
> > +                     tun_xdp_one(tun, tfile, &queued, xdp, &flush, &tpage);
>
>
> How big n can be ?
>
> BTW I could not find where m->msg_controllen was checked in tun_sendmsg().
>
> struct tun_msg_ctl *ctl = m->msg_control;
>
> if (ctl && (ctl->type == TUN_MSG_PTR)) {
>
>      int n = ctl->num;  // can be set to values in [0..65535]
>
>      for (i = 0; i < n; i++) {
>
>          xdp = &((struct xdp_buff *)ctl->ptr)[i];
>
>
> I really do not understand how we prevent malicious user space from
> crashing the kernel.

It looks to me the only user for this is vhost-net which limits it to
64, userspace can't use sendmsg() directly on tap.

Thanks

>
>
>
> >               }
> >
> >               if (flush)
> >                       xdp_do_flush();
> >
> > +             if (tfile->napi_enabled && queued > 0)
> > +                     napi_schedule(&tfile->napi);
> > +
> >               rcu_read_unlock();
> >               local_bh_enable();
> >
>