On Wed, Apr 07, 2010 at 10:37:17AM -0700, David Stevens wrote: > > > > Thanks! > > There's some whitespace damage, are you sending with your new > > sendmail setup? It seems to have worked for qemu patches ... > > Yes, I saw some line wraps in what I received, but I checked > the original draft to be sure and they weren't there. Possibly from > the relay; Sigh. > > > > > @@ -167,8 +166,15 @@ static void handle_tx(struct vhost_net * > > > /* TODO: Check specific error and bomb out unless ENOBUFS? */ > > > err = sock->ops->sendmsg(NULL, sock, &msg, len); > > > if (unlikely(err < 0)) { > > > - vhost_discard_vq_desc(vq); > > > - tx_poll_start(net, sock); > > > + if (err == -EAGAIN) { > > > + vhost_discard_desc(vq, 1); > > > + tx_poll_start(net, sock); > > > + } else { > > > + vq_err(vq, "sendmsg: errno %d\n", -err); > > > + /* drop packet; do not discard/resend */ > > > + vhost_add_used_and_signal(&net->dev, vq, head, > > > + 0); > > > > vhost does not currently has a consistent error handling strategy: > > if we drop packets, need to think which other errors should cause > > packet drops. I prefer to just call vq_err for now, > > and have us look at handling segfaults etc in a consistent way > > separately. > > I had to add this to avoid an infinite loop when I wrote a bad > packet on the socket. I agree error handling needs a better look, > but retrying a bad packet continuously while dumping in the log > is what it was doing when I hit an error before this code. Isn't > this better, at least until a second look? > Hmm, what do you mean 'continuously'? Don't we only try again on next kick? > > > +} > > > + > > > > I wonder whether it makes sense to check > > skb_queue_empty(&sk->sk_receive_queue) > > outside the lock, to reduce the cost of this call > > on an empty queue (we know that it happens at least once > > each time we exit the loop on rx)? > > I was looking at alternatives to adding the lock in the > first place, but I found I couldn't measure a difference in the > cost with and without the lock. > > > > > + int retries = 0; > > > size_t len, total_len = 0; > > > - int err; > > > + int err, headcount, datalen; > > > size_t hdr_size; > > > struct socket *sock = rcu_dereference(vq->private_data); > > > if (!sock || skb_queue_empty(&sock->sk->sk_receive_queue)) > > > @@ -222,31 +242,25 @@ static void handle_rx(struct vhost_net * > > > vq_log = unlikely(vhost_has_feature(&net->dev, VHOST_F_LOG_ALL)) ? > > > vq->log : NULL; > > > > > > - for (;;) { > > > - head = vhost_get_vq_desc(&net->dev, vq, vq->iov, > > > - ARRAY_SIZE(vq->iov), > > > - &out, &in, > > > - vq_log, &log); > > > + while ((datalen = vhost_head_len(sock->sk))) { > > > + headcount = vhost_get_desc_n(vq, vq->heads, datalen, &in, > > > + vq_log, &log); > > > > This looks like a bug, I think we need to pass > > datalen + header size to vhost_get_desc_n. > > Not sure how we know the header size that backend will use though. > > Maybe just look at our features. > > Yes; we have hdr_size, so I can add it here. It'll be 0 for > the cases where the backend and guest both have vnet header (either > the regular or larger mergeable buffers one), but should be added > in for future raw socket support. So hdr_size is the wrong thing to add then. We need to add non-zero value for tap now. > > > > > /* OK, now we need to know about added descriptors. */ > > > - if (head == vq->num) { > > > - if (unlikely(vhost_enable_notify(vq))) { > > > + if (!headcount) { > > > + if (retries == 0 && unlikely(vhost_enable_notify(vq))) { > > > /* They have slipped one in as we were > > > * doing that: check again. */ > > > vhost_disable_notify(vq); > > > + retries++; > > > continue; > > > } > > > > Hmm. The reason we have the code at all, as the comment says, is because > > guest could have added more buffers between the time we read last index > > and the time we enabled notification. So if we just break like this > > the race still exists. We could remember the > > last head value we observed, and have vhost_enable_notify check > > against this value? > > This is to prevent a spin loop in the case where we have some > buffers available, but not enough for the current packet (ie, this > is the replacement code for the "rxmaxheadcount" business). If they > actually added something new, retrying once should see it, but what > vhost_enable_notify() returns non-zero on is not "new buffers" but > rather "not empty". In the case mentioned, we aren't empty, so > vhost_enable_notify() returns nonzero every time, but the guest hasn't > given us enough buffers to proceed, so we continuously retry; this > code breaks the spin loop until we've really got new buffers from > the guest. What about the race I point out above? It seems to potentially cause a deadlock. > > > > Need to think about it. > > > > Another concern here is that on retries vhost_get_desc_n > > is doing extra work, rescanning the same descriptor > > again and again. Not sure how common this is, might be > > worthwhile to add a TODO to consider this at least. > > I had a printk in there to test the code and with the > retries counter, it happens when we fill the ring (once, > because of the retries checks), and then proceeds as > desired when the guest gives us more buffers. Without the > check, it spews until we hear from the guest. I don't understand whether what you write above means 'yes this happens and is worth optimizing for' or 'this happens very rarely and is not worth optimizing for'. > > > @@ -261,14 +275,33 @@ static void handle_rx(struct vhost_net * > > > len, MSG_DONTWAIT | MSG_TRUNC); > > > /* TODO: Check specific error and bomb out unless EAGAIN? */ > > > if (err < 0) { > > > > I think we need to compare err and datalen and drop packet on mismatch > as well. > > The check err > len won't be needed then. > > OK, but this is the original code, unchanged by my patch. :-) Right. original code does not know the datalen. > > > - vhost_discard_vq_desc(vq); > > > + vhost_discard_desc(vq, headcount); > > > break; > > > } > > > /* TODO: Should check and handle checksum. */ > > > + if (vhost_has_feature(&net->dev, VIRTIO_NET_F_MRG_RXBUF)) { > > > + struct virtio_net_hdr_mrg_rxbuf *vhdr = > > > + (struct virtio_net_hdr_mrg_rxbuf *) > > > + vq->iov[0].iov_base; > > > + /* add num_buffers */ > > > + if (vhost_has_feature(&net->dev, > > > + VHOST_NET_F_VIRTIO_NET_HDR)) > > > + hdr.num_buffers = headcount; > > > > Why is the above necessary? > > This adds mergeable buffer support for the raw case. So my suggestion is to have a copy of the header and then same code will fill in the field for both raw and tap. > > > > > + else if (vq->iov[0].iov_len < sizeof(*vhdr)) { > > > > I think length check is already done when we copy the header. No? > > This sets num_buffers for the tap case (where hdr_size is 0). Ah, right. So let's just check the total length is > sizeof(*vhdr). > We don't copy any headers at all for this case, but we need to > add num_buffers at offset 10 in the user buffer already read by > recvmsg(). > > > > > > + vq_err(vq, "tiny buffers < %d unsupported", > > > + vq->iov[0].iov_len); > > > + vhost_discard_desc(vq, headcount); > > > + break; > > > > Problem here is that recvmsg might modify iov. > > This is why I think we need to use vq->hdr here. > > rcvmsg() can never modify the iovec; Look at af_packet code for an example where it does. The pointer is non-const, it's OK to modify. tap used to modify iovec as well, the fact it doesn't now is a side-effect of reusing same code for recvmsg and aio read. > it's the standard socket call. It's an internal API that is used to implement the call. iovec it gets is a kernel side copy of what user passed in. > To use vq->hdr, we'd have to copy > it in from user space, modify it, and then copy it back > in to user space, on every packet. In the normal tap case, > hdr_size is 0 and we read it directly from the socket to > user space. It is already correct for mergeable buffers, > except for the num_buffers value added here. I don't understand what you write above, sorry. We have iov, all we need is store the first address+length in the hdr field. This should be close to free and does not involve any copies from/to userspace. All the branching and special-casing is possibly more expensive. > > > > > + } else if (put_user(headcount, &vhdr->num_buffers)) { > > > > The above put_user writes out a 32 bit value, right? > > This seems wrong. > > I'll recheck this -- I'll make the type of "headcount" be > the type of "num_buffers" in the MRXB vnet header, if it isn't > already. No, I was confused Sorry about the noise. > > > > How about using > > memcpy_toiovecend(vq->hdr, &headcount, > > offsetof(struct virtio_net_hdr_mrg_rxbuf, num_buffers), > > sizeof headcount); > > > > this way we also do not make any assumptions about layout. > > Because this overwrites the valid vnet header we got from > the tap socket with a local copy that isn't correct. It does? I think that this only writes out 2 bytes at an offset. > For this to > work, we first need to copy_from_user() the vnet header from the > socket into vq->hdr (on every packet). copy_from_user from vnet header to vq->hdr does not seem to make any sense and is not what I suggested. > It doesn't assume anything here-- it requires (and checks) > that the user didn't give us a <12 byte buffer, which I think is > reasonable. > That's about the size of the descriptor for the buffer, > and I'd call that a broken guest. I think it's better not to assume this. virtio spec does mentions layout assumptions as legacy limitations. Guest can post descriptor consisting of 2 buffers: 1. 10 bytes. 2. 2 bytes. and I don't see a reason not to support this unless this makes code simpler. In this case code is more complex. > The cost of supporting those > tiny buffers in the general case would be a copy_from_user() and > copy_to_user() of the vnet_hdr on every packet, which I think is > not worth it (and especially since I don't expect any guest is > ever going to give us a <12 byte buffer in the first place). You keep saying this, but I do not see any need for extra copy_to_user. Just use memcpy_toiovecend instead of put_user. > > > @@ -560,9 +593,14 @@ done: > > > > > > static int vhost_net_set_features(struct vhost_net *n, u64 features) > > > { > > > - size_t hdr_size = features & (1 << VHOST_NET_F_VIRTIO_NET_HDR) ? > > > - sizeof(struct virtio_net_hdr) : 0; > > > + size_t hdr_size = 0; > > > int i; > > > + > > > + if (features & (1 << VHOST_NET_F_VIRTIO_NET_HDR)) { > > > + hdr_size = sizeof(struct virtio_net_hdr); > > > + if (features & (1 << VIRTIO_NET_F_MRG_RXBUF)) > > > + hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf); > > > > My personal style for this would be: > > if (!(features & (1 << VHOST_NET_F_VIRTIO_NET_HDR))) > > hdr_size = 0 > > else if (!(features & (1 << VIRTIO_NET_F_MRG_RXBUF))) > > hdr_size = sizeof(virtio_net_hdr); > > else > > hdr_size = sizeof(struct virtio_net_hdr_mrg_rxbuf); > > > > which results in more symmetry and less nesting. > > OK. > > > > > > mutex_lock(&n->dev.mutex); > > > if ((features & (1 << VHOST_F_LOG_ALL)) && > > > !vhost_log_access_ok(&n->dev)) { > > > diff -ruNp net-next-p0/drivers/vhost/vhost.c > > > net-next-v3/drivers/vhost/vhost.c > > > --- net-next-p0/drivers/vhost/vhost.c 2010-03-22 12:04:38.000000000 > > > -0700 > > > +++ net-next-v3/drivers/vhost/vhost.c 2010-04-06 12:57:51.000000000 > > > -0700 > > > @@ -856,6 +856,47 @@ static unsigned get_indirect(struct vhos > > > return 0; > > > } > > > > > > +/* This is a multi-buffer version of vhost_get_vq_desc > > > + * @vq - the relevant virtqueue > > > + * datalen - data length we'll be reading > > > + * @iovcount - returned count of io vectors we fill > > > + * @log - vhost log > > > + * @log_num - log offset > > > + * returns number of buffer heads allocated, 0 on error > > > > This is unusual. Let's return a negative error code on error. > > This is like malloc -- "0" is never a valid return value, and > we don't have actual error values to return and handle; they generate > vq_err() messages within the code itself. In all cases, > it is the count of buffers we allocated (which is 0 when we fail to > allocate). So, I don't agree with this, but can do it if you insist. malloc returns NULL, not zero. standard error handling approaches are: - NULL on error - ERR_PTR on error - >= 0 for success, -errno for error - true for success false for error And of course - some custom strategy so you need to read documentation to figure out how to use a function which is where this one gets classified. So I am not saying 0 won't work, just that standard stuff is better. > > > > > + */ > > > +int vhost_get_desc_n(struct vhost_virtqueue *vq, struct > vring_used_elem > > > *heads, > > > + int datalen, int *iovcount, struct vhost_log *log, > > > + unsigned int *log_num) > > > +{ > > > + int out, in; > > > + int seg = 0; /* iov index */ > > > + int hc = 0; /* head count */ > > > + > > > + while (datalen > 0) { > > > + if (hc >= VHOST_NET_MAX_SG) > > > + goto err; > > > + heads[hc].id = vhost_get_desc(vq->dev, vq, vq->iov+seg, > > > + ARRAY_SIZE(vq->iov)-seg, &out, > > > + &in, log, log_num); > > > + if (heads[hc].id == vq->num) > > > + goto err; > > > + if (out || in <= 0) { > > > + vq_err(vq, "unexpected descriptor format for RX: " > > > + "out %d, in %d\n", out, in); > > > + goto err; > > > + } > > > + heads[hc].len = iov_length(vq->iov+seg, in); > > > + datalen -= heads[hc].len; > > > > This signed/unsigned mix makes me nervuous. > > Let's make datalen unsigned, add unsigned total_len, and > > while (datalen < total_len). > > > > > + hc++; > > > + seg += in; > > > + } > > > + *iovcount = seg; > > > + return hc; > > > +err: > > > + vhost_discard_desc(vq, hc); > > > + return 0; > > > +} > > > + > > > /* This looks in the virtqueue and for the first available buffer, > and > > > converts > > > * it to an iovec for convenient access. Since descriptors consist > of > > > some > > > * number of output then some number of input descriptors, it's > > > actually two > > > @@ -863,7 +904,7 @@ static unsigned get_indirect(struct vhos > > > * > > > * This function returns the descriptor number found, or vq->num > (which > > > * is never a valid descriptor number) if none was found. */ > > > -unsigned vhost_get_vq_desc(struct vhost_dev *dev, struct > > > vhost_virtqueue *vq, > > > +unsigned vhost_get_desc(struct vhost_dev *dev, struct vhost_virtqueue > > > *vq, > > > struct iovec iov[], unsigned int iov_size, > > > unsigned int *out_num, unsigned int *in_num, > > > struct vhost_log *log, unsigned int *log_num) > > > @@ -981,31 +1022,42 @@ unsigned vhost_get_vq_desc(struct vhost_ > > > } > > > > > > /* Reverse the effect of vhost_get_vq_desc. Useful for error > handling. > > > */ > > > -void vhost_discard_vq_desc(struct vhost_virtqueue *vq) > > > +void vhost_discard_desc(struct vhost_virtqueue *vq, int n) > > > { > > > - vq->last_avail_idx--; > > > + vq->last_avail_idx -= n; > > > } > > > > > > /* After we've used one of their buffers, we tell them about it. > We'll > > > then > > > * want to notify the guest, using eventfd. */ > > > -int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int > > > len) > > > +int vhost_add_used(struct vhost_virtqueue *vq, struct vring_used_elem > > > *heads, > > > + int count) > > > > I think we are better off with vhost_add_used and vhost_add_used_n: > > the version with _n has a lot of extra complexity, and tx always > > adds them 1 by one. > > The external code only calls vhost_add_used_and_signal(), so I > split that. I can add a single variant of vhost_add_used() too, but > I was trying to avoid adding duplicate code and external interface. > I don't actually agree with splitting these; isn't it better to have > them going through the same code path whether it's single or multiple? > A bug in one would show up as an intermittent failure, and a change > to one requires changing both. It's a small function so I am not really worried about maintainance. > I don't think the multiple should just > do the single repeatedly, either, since the multiple variant now can > do the work in one or two copy_to_user()'s, without a loop; Yes. > so splitting > these just makes a special case for single to carry in code maintenance > with no benefit, I think. I donnu, I have a feeling all these loops and memory accesses on data path won't be free, and for tx this just adds overhead. > > > > > { > > > struct vring_used_elem *used; > > > + int start, n; > > > + > > > + if (count <= 0) > > > + return -EINVAL; > > > > > > - /* The virtqueue contains a ring of used buffers. Get a pointer > to > > > the > > > - * next entry in that used ring. */ > > > - used = &vq->used->ring[vq->last_used_idx % vq->num]; > > > - if (put_user(head, &used->id)) { > > > - vq_err(vq, "Failed to write used id"); > > > + start = vq->last_used_idx % vq->num; > > > + if (vq->num - start < count) > > > + n = vq->num - start; > > > + else > > > + n = count; > > > > use min? > > Sure. > > > > > > + used = vq->used->ring + start; > > > + if (copy_to_user(used, heads, sizeof(heads[0])*n)) { > > > + vq_err(vq, "Failed to write used"); > > > return -EFAULT; > > > } > > > - if (put_user(len, &used->len)) { > > > - vq_err(vq, "Failed to write used len"); > > > - return -EFAULT; > > > + if (n < count) { /* wrapped the ring */ > > > + used = vq->used->ring; > > > + if (copy_to_user(used, heads+n, sizeof(heads[0])*(count-n))) { > > > + vq_err(vq, "Failed to write used"); > > > + return -EFAULT; > > > + } > > > } > > > /* Make sure buffer is written before we update index. */ > > > smp_wmb(); > > > - if (put_user(vq->last_used_idx + 1, &vq->used->idx)) { > > > + if (put_user(vq->last_used_idx+count, &vq->used->idx)) { > > > > I am a bit confused ... will this write a 32 or 16 bit value? > > count is 32 bit ... Maybe we are better off with > > u16 idx = vq->last_used_idx + count > > put_user(idx, &vq->used->idx) > > vq->last_used_idx = idx > > How about a cast?: > > if (put_user((u16)(vq->last_used_idx+count), &vq->used->idx)) { > > > +-DLS No, I was confused. Code is ok (just whitespace damaged). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html