On Sun, Mar 07, 2010 at 04:34:32PM -0800, David Stevens wrote: > "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 03/07/2010 07:31:30 AM: > > > On Tue, Mar 02, 2010 at 05:20:15PM -0700, David Stevens wrote: > > > This patch generalizes buffer handling functions to > NULL, NULL); > > > + head.iov_base = (void *)vhost_get_vq_desc(&net->dev, > vq, > > > + vq->iov, ARRAY_SIZE(vq->iov), &out, &in, NULL, > > > > NULL); > > > > Should type for head be changed so that we do not need to cast? > > > > I also prefer aligning descendants on the opening (. > > Yes, that's probably better; the indentation with the cast would > still wrap a lot, but I'll see what I can do here. > > > > > err = sock->ops->sendmsg(NULL, sock, &msg, len); > > > if (unlikely(err < 0)) { > > > - vhost_discard_vq_desc(vq); > > > + vhost_discard(vq, 1); > > > > Isn't the original name a bit more descriptive? > > During development, I had both and I generally like > shorter names, but I can change it back. > > > > +static int skb_head_len(struct sk_buff_head *skq) > > > +{ > > > + struct sk_buff *head; > > > + > > > + head = skb_peek(skq); > > > + if (head) > > > + return head->len; > > > + return 0; > > > +} > > > + > > > > This is done without locking, which I think can crash > > if skb is consumed after we peek it but before we read the > > length. > > This thread is the only legitimate consumer, right? But > qemu has the file descriptor and I guess we shouldn't trust > that it won't give it to someone else; it'd break vhost, but > a crash would be inappropriate, of course. I'd like to avoid > the lock, but I have another idea here, so will investigate. > > > > > > > > /* Expects to be always run from workqueue - which acts as > > > * read-size critical section for our kind of RCU. */ > > > static void handle_rx(struct vhost_net *net) > > > { > > > struct vhost_virtqueue *vq = &net->dev.vqs[VHOST_NET_VQ_RX]; > > > - unsigned head, out, in, log, s; > > > + unsigned in, log, s; > > > struct vhost_log *vq_log; > > > struct msghdr msg = { > > > .msg_name = NULL, > > > @@ -204,10 +213,11 @@ > > > }; > > > > > > size_t len, total_len = 0; > > > - int err; > > > + int err, headcount, datalen; > > > size_t hdr_size; > > > struct socket *sock = rcu_dereference(vq->private_data); > > > - if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue)) > > > + > > > + if (!sock || !skb_head_len(&sock->sk->sk_receive_queue)) > > > return; > > > > > > > Isn't this equivalent? Do you expect zero len skbs in socket? > > If yes, maybe we should discard these, not stop processing ... > > A zero return means "no skb's". They are equivalent; I > changed this call just to make it identical to the loop check, > but I don't have a strong attachment to this. > > > > > vq_log = unlikely(vhost_has_feature(&net->dev, > VHOST_F_LOG_ALL)) ? > > > vq->log : NULL; > > > > > > - for (;;) { > > > - head = vhost_get_vq_desc(&net->dev, vq, vq->iov, > > > - ARRAY_SIZE(vq->iov), > > > - &out, &in, > > > - vq_log, &log); > > > + while ((datalen = skb_head_len(&sock->sk->sk_receive_queue))) > { > > > > This peeks in the queue to figure out how much data we have. > > It's a neat trick, but it does introduce new failure modes > > where an skb is consumed after we did the peek: > > - new skb could be shorter, we should return the unconsumed part > > - new skb could be longer, this will drop a packet. > > maybe this last is not an issue as the race should be rare in practice > > As before, this loop is the only legitimate consumer of skb's; if > anyone else is reading the socket, it's broken. But since the file > descriptor is available to qemu, it's probably trusting qemu too much. > I don't believe there is a race at all on a working system; the head > can't change until we read it after this test, but I'll see if I can > come up with something better here. Closing the fd for qemu so vhost > has exclusive access might do it, but otherwise we could just get a > max-sized packet worth of buffers and then return them after the read. > That'd force us to wait with small packets when the ring is nearly > full, where we don't have to now, but I expect locking to read the head > length will hurt performance in all cases. Will try these ideas out.k > > > > > > + headcount = vhost_get_heads(vq, datalen, &in, vq_log, > > > &log); > > > /* OK, now we need to know about added descriptors. */ > > > - if (head == vq->num) { > > > + if (!headcount) { > > > if (unlikely(vhost_enable_notify(vq))) { > > > /* They have slipped one in as we were > > > * doing that: check again. */ > > > @@ -235,13 +242,6 @@ > > > * they refilled. */ > > > break; > > > } > > > - /* We don't need to be notified again. */ > > > > you find this comment unhelpful? > > This code is changed in the later patches; the comment was > removed with that, but I got it in the wrong patch on the split. > I guess the comment is ok to stay, anyway, but notification may > require multiple buffers to be available; I had fixed that here, but > the final patch pushes that into the notify code, so yes, this can > go back in. > > > > - if (out) { > > > - vq_err(vq, "Unexpected descriptor format for > RX: " > > > - "out %d, int %d\n", > > > - out, in); > > > - break; > > > - } > > > > > > we still need this check, don't we? > > It's done in vhost_get_heads(); "out" isn't visible in handle_rx() > anymore. > > > > +unsigned vhost_get_heads(struct vhost_virtqueue *vq, int datalen, int > > > > *iovcount, > > > + struct vhost_log *log, unsigned int *log_num) > > > > Could you please document this function's parameters? It's not obvious > > what iovcount, log, log_num are. > > Yes. To answer the question, they are the same as vhost_get_vq_desc(), > except "iovcount" replaces "in" (and "out" is not needed in the caller). > > > > +{ > > > + struct iovec *heads = vq->heads; > > > + int out, in; > > > + int hc = 0; > > > + > > > + while (datalen > 0) { > > > + if (hc >= VHOST_NET_MAX_SG) { > > > + vhost_discard(vq, hc); > > > + return 0; > > > + } > > > + heads[hc].iov_base = (void > *)vhost_get_vq_desc(vq->dev, > > > vq, > > > + vq->iov, ARRAY_SIZE(vq->iov), &out, &in, log, > > > log_num); > > > + if (heads[hc].iov_base == (void *)vq->num) { > > > + vhost_discard(vq, hc); > > > + return 0; > > > + } > > > + if (out || in <= 0) { > > > + vq_err(vq, "unexpected descriptor format for > RX: " > > > + "out %d, in %d\n", out, in); > > > + vhost_discard(vq, hc); > > > + return 0; > > > + } > > > + heads[hc].iov_len = iov_length(vq->iov, in); > > > + hc++; > > > + datalen -= heads[hc].iov_len; > > > + } > > > + *iovcount = in; > > > > > > Only the last value? > > In this part of the split, it only goes through once; this is > changed to the accumulated value "seg" in a later patch. So, split > artifact. > > { > > > struct vring_used_elem *used; > > > + int i; > > > > > > - /* The virtqueue contains a ring of used buffers. Get a > pointer > > > to the > > > - * next entry in that used ring. */ > > > - used = &vq->used->ring[vq->last_used_idx % vq->num]; > > > - if (put_user(head, &used->id)) { > > > - vq_err(vq, "Failed to write used id"); > > > - return -EFAULT; > > > - } > > > - if (put_user(len, &used->len)) { > > > - vq_err(vq, "Failed to write used len"); > > > - return -EFAULT; > > > + for (i=0; i<count; i++, vq->last_used_idx++) { > > > > whitespace damage: I prefer space around =, <. > > I also use ++i, etc in this driver, better be consistent? > > Also for clarity, I prefer to put vq->last_used_idx inside the loop. > > OK. > > > > > + used = &vq->used->ring[vq->last_used_idx % vq->num]; > > > + if (put_user((unsigned)heads[i].iov_base, &used->id)) > { > > > + vq_err(vq, "Failed to write used id"); > > > + return -EFAULT; > > > + } > > > + if (put_user(heads[i].iov_len, &used->len)) { > > > + vq_err(vq, "Failed to write used len"); > > > + return -EFAULT; > > > + } > > > > If this fails, last_used_idx will still be incremented, which I think is > wrong. > > True, but if these fail, aren't we dead? I don't think we can > recover > from an EFAULT in any of these; I didn't test those paths from the > original, > but I think we need to bail out entirely for these cases, right? > > +-DLS Yes, we stop on error, but host can fix uo the vq and redo a kick. That's why there's errorfd. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html