Re: [RFC][ PATCH 1/3] vhost-net: support multiple buffer heads in receiver

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Mon, 8 Mar 2010 09:45:33 +0200

On Sun, Mar 07, 2010 at 04:34:32PM -0800, David Stevens wrote:
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 03/07/2010 07:31:30 AM:
> 
> > On Tue, Mar 02, 2010 at 05:20:15PM -0700, David Stevens wrote:
> > > This patch generalizes buffer handling functions to
>                                       NULL, NULL);
> > > +               head.iov_base = (void *)vhost_get_vq_desc(&net->dev, 
> vq,
> > > +                       vq->iov, ARRAY_SIZE(vq->iov), &out, &in, NULL, 
> 
> > > NULL);
> > 
> > Should type for head be changed so that we do not need to cast?
> > 
> > I also prefer aligning descendants on the opening (.
> 
>         Yes, that's probably better; the indentation with the cast would
> still wrap a lot, but I'll see what I can do here.
> 
> 
> > >                 err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > >                 if (unlikely(err < 0)) {
> > > -                       vhost_discard_vq_desc(vq);
> > > +                       vhost_discard(vq, 1);
> > 
> > Isn't the original name a bit more descriptive?
> 
>         During development, I had both and I generally like
> shorter names, but I can change it back.
> 
> > > +static int skb_head_len(struct sk_buff_head *skq)
> > > +{
> > > +       struct sk_buff *head;
> > > +
> > > +       head = skb_peek(skq);
> > > +       if (head)
> > > +               return head->len;
> > > +       return 0;
> > > +}
> > > +
> > 
> > This is done without locking, which I think can crash
> > if skb is consumed after we peek it but before we read the
> > length.
> 
>         This thread is the only legitimate consumer, right? But
> qemu has the file descriptor and I guess we shouldn't trust
> that it won't give it to someone else; it'd break vhost, but
> a crash would be inappropriate, of course. I'd like to avoid
> the lock, but I have another idea here, so will investigate.
> 
> > 
> > 
> > >  /* Expects to be always run from workqueue - which acts as
> > >   * read-size critical section for our kind of RCU. */
> > >  static void handle_rx(struct vhost_net *net)
> > >  {
> > >         struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX];
> > > -       unsigned head, out, in, log, s;
> > > +       unsigned in, log, s;
> > >         struct vhost_log *vq_log;
> > >         struct msghdr msg = {
> > >                 .msg_name = NULL,
> > > @@ -204,10 +213,11 @@
> > >         };
> > > 
> > >         size_t len, total_len = 0;
> > > -       int err;
> > > +       int err, headcount, datalen;
> > >         size_t hdr_size;
> > >         struct socket *sock = rcu_dereference(vq->private_data);
> > > -       if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue))
> > > +
> > > +       if (!sock || !skb_head_len(&sock->sk->sk_receive_queue))
> > >                 return;
> > > 
> > 
> > Isn't this equivalent? Do you expect zero len skbs in socket?
> > If yes, maybe we should discard these, not stop processing ...
> 
>         A zero return means "no skb's". They are equivalent; I
> changed this call just to make it identical to the loop check,
> but I don't have a strong attachment to this.
> 
> 
> > >         vq_log = unlikely(vhost_has_feature(&net->dev, 
> VHOST_F_LOG_ALL)) ?
> > >                 vq->log : NULL;
> > > 
> > > -       for (;;) {
> > > -               head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > > -                                        ARRAY_SIZE(vq->iov),
> > > -                                        &out, &in,
> > > -                                        vq_log, &log);
> > > +       while ((datalen = skb_head_len(&sock->sk->sk_receive_queue))) 
> {
> > 
> > This peeks in the queue to figure out how much data we have.
> > It's a neat trick, but it does introduce new failure modes
> > where an skb is consumed after we did the peek:
> > - new skb could be shorter, we should return the unconsumed part
> > - new skb could be longer, this will drop a packet.
> >   maybe this last is not an issue as the race should be rare in practice
> 
>         As before, this loop is the only legitimate consumer of skb's; if
> anyone else is reading the socket, it's broken. But since the file
> descriptor is available to qemu, it's probably trusting qemu too much.
> I don't believe there is a race at all on a working system; the head
> can't change until we read it after this test, but I'll see if I can
> come up with something better here. Closing the fd for qemu so vhost
> has exclusive access might do it, but otherwise we could just get a
> max-sized packet worth of buffers and then return them after the read.
> That'd force us to wait with small packets when the ring is nearly
> full, where we don't have to now, but I expect locking to read the head
> length will hurt performance in all cases. Will try these ideas out.k
> 
> > 
> > > +               headcount = vhost_get_heads(vq, datalen, &in, vq_log, 
> > > &log);
> > >                 /* OK, now we need to know about added descriptors. */
> > > -               if (head == vq->num) {
> > > +               if (!headcount) {
> > >                         if (unlikely(vhost_enable_notify(vq))) {
> > >                                 /* They have slipped one in as we were
> > >                                  * doing that: check again. */
> > > @@ -235,13 +242,6 @@
> > >                          * they refilled. */
> > >                         break;
> > >                 }
> > > -               /* We don't need to be notified again. */
> > 
> > you find this comment unhelpful?
> 
>         This code is changed in the later patches; the comment was
> removed with that, but I got it in the wrong patch on the split.
> I guess the comment is ok to stay, anyway, but notification may
> require multiple buffers to be available; I had fixed that here, but
> the final patch pushes that into the notify code, so yes, this can
> go back in.
> 
> > > -               if (out) {
> > > -                       vq_err(vq, "Unexpected descriptor format for 
> RX: "
> > > -                              "out %d, int %d\n",
> > > -                              out, in);
> > > -                       break;
> > > -               }
> > 
> > 
> > we still need this check, don't we?
> 
>         It's done in vhost_get_heads(); "out" isn't visible in handle_rx()
> anymore.
>  
> > > +unsigned vhost_get_heads(struct vhost_virtqueue *vq, int datalen, int 
> 
> > > *iovcount,
> > > +       struct vhost_log *log, unsigned int *log_num)
> > 
> > Could you please document this function's parameters? It's not obvious
> > what iovcount, log, log_num are.
> 
> Yes. To answer the question, they are the same as vhost_get_vq_desc(),
> except "iovcount" replaces "in" (and "out" is not needed in the caller).
> 
> > > +{
> > > +       struct iovec *heads = vq->heads;
> > > +       int out, in;
> > > +       int hc = 0;
> > > +
> > > +       while (datalen > 0) {
> > > +               if (hc >= VHOST_NET_MAX_SG) {
> > > +                       vhost_discard(vq, hc);
> > > +                       return 0;
> > > +               }
> > > +               heads[hc].iov_base = (void 
> *)vhost_get_vq_desc(vq->dev, 
> > > vq,
> > > +                       vq->iov, ARRAY_SIZE(vq->iov), &out, &in, log, 
> > > log_num);
> > > +               if (heads[hc].iov_base == (void *)vq->num) {
> > > +                       vhost_discard(vq, hc);
> > > +                       return 0;
> > > +               }
> > > +               if (out || in <= 0) {
> > > +                       vq_err(vq, "unexpected descriptor format for 
> RX: "
> > > +                               "out %d, in %d\n", out, in);
> > > +                       vhost_discard(vq, hc);
> > > +                       return 0;
> > > +               }
> > > +               heads[hc].iov_len = iov_length(vq->iov, in);
> > > +               hc++;
> > > +               datalen -= heads[hc].iov_len;
> > > +       }
> > > +       *iovcount = in;
> > 
> > 
> > Only the last value?
> 
>         In this part of the split, it only goes through once; this is
> changed to the accumulated value "seg" in a later patch. So, split
> artifact.
> 
>   {
> > >         struct vring_used_elem *used;
> > > +       int i;
> > > 
> > > -       /* The virtqueue contains a ring of used buffers.  Get a 
> pointer 
> > > to the
> > > -        * next entry in that used ring. */
> > > -       used = &vq->used->ring[vq->last_used_idx % vq->num];
> > > -       if (put_user(head, &used->id)) {
> > > -               vq_err(vq, "Failed to write used id");
> > > -               return -EFAULT;
> > > -       }
> > > -       if (put_user(len, &used->len)) {
> > > -               vq_err(vq, "Failed to write used len");
> > > -               return -EFAULT;
> > > +       for (i=0; i<count; i++, vq->last_used_idx++) {
> > 
> > whitespace damage: I prefer space around =, <.
> > I also use ++i, etc in this driver, better be consistent?
> > Also for clarity, I prefer to put vq->last_used_idx inside the loop.
> 
>         OK.
> > 
> > > +               used = &vq->used->ring[vq->last_used_idx % vq->num];
> > > +               if (put_user((unsigned)heads[i].iov_base, &used->id)) 
> {
> > > +                       vq_err(vq, "Failed to write used id");
> > > +                       return -EFAULT;
> > > +               }
> > > +               if (put_user(heads[i].iov_len, &used->len)) {
> > > +                       vq_err(vq, "Failed to write used len");
> > > +                       return -EFAULT;
> > > +               }
> > 
> > If this fails, last_used_idx will still be incremented, which I think is 
> wrong.
> 
>         True, but if these fail, aren't we dead? I don't think we can 
> recover
> from an EFAULT in any of these; I didn't test those paths from the 
> original,
> but I think we need to bail out entirely for these cases, right?
> 
>  +-DLS

Yes, we stop on error, but host can fix uo the vq and redo a kick.
That's why there's errorfd.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html