Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 2 Mar 2011 12:06:00 +0200

On Tue, Mar 01, 2011 at 09:32:56PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 02/28/2011 03:13:20 PM:
> 
> Thank you once again for your feedback on both these patches.
> I will send the qemu patch tomorrow. I will also send the next
> version incorporating these suggestions once we finalize some
> minor points.
> 
> > Overall looks good.
> > The numtxqs meaning the number of rx queues needs some cleanup.
> > init/cleanup routines need more symmetry.
> > Error handling on setup also seems slightly buggy or at least
> asymmetrical.
> > Finally, this will use up a large number of MSI vectors,
> > while TX interrupts mostly stay unused.
> >
> > Some comments below.
> >
> > > +/* Maximum number of individual RX/TX queues supported */
> > > +#define VIRTIO_MAX_TXQS		 		 16
> > > +
> >
> > This also does not seem to belong in the header.
> 
> Both virtio-net and vhost need some check to make sure very
> high values are not passed by userspace. Is this not required?

Whatever we stick in the header is effectively part of
host/gues interface. Are you sure we'll never want
more than 16 VQs? This value does not seem that high.

> > > +#define VIRTIO_NET_F_NUMTXQS		 21		 /* Device supports multiple
> TX queue */
> >
> > VIRTIO_NET_F_MULTIQUEUE ?
> 
> Yes, that's a better name.
> 
> > > @@ -34,6 +38,8 @@ struct virtio_net_config {
> > >  		 __u8 mac[6];
> > >  		 /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
> > >  		 __u16 status;
> > > +		 /* number of RX/TX queues */
> > > +		 __u16 numtxqs;
> >
> > The interface here is a bit ugly:
> > - this is really both # of tx and rx queues but called numtxqs
> > - there's a hardcoded max value
> > - 0 is assumed to be same as 1
> > - assumptions above are undocumented.
> >
> > One way to address this could be num_queue_pairs, and something like
> > 		 /* The actual number of TX and RX queues is num_queue_pairs +
> 1 each. */
> > 		 __u16 num_queue_pairs;
> > (and tweak code to match).
> >
> > Alternatively, have separate registers for the number of tx and rx
> queues.
> 
> OK, so virtio_net_config has num_queue_pairs, and this gets converted to
> numtxqs in virtnet_info?

Or put num_queue_pairs in virtnet_info too.

> > > +struct virtnet_info {
> > > +		 struct send_queue **sq;
> > > +		 struct receive_queue **rq;
> > > +
> > > +		 /* read-mostly variables */
> > > +		 int numtxqs ____cacheline_aligned_in_smp;
> >
> > Why do you think this alignment is a win?
> 
> Actually this code was from the earlier patchset (MQ TX only) where
> the layout was different. Now rq and sq are allocated as follows:
> 	vi->sq = kzalloc(numtxqs * sizeof(*vi->sq), GFP_KERNEL);
> 	for (i = 0; i < numtxqs; i++) {
> 		vi->sq[i] = kzalloc(sizeof(*vi->sq[i]), GFP_KERNEL);
> Since the two pointers becomes read-only during use, there is no cache
> line dirty'ing.  I will remove this directive.
> 
> > > +/*
> > > + * Note for 'qnum' below:
> > > + *		 first 'numtxqs' vqs are RX, next 'numtxqs' vqs are TX.
> > > + */
> >
> > Another option to consider is to have them RX,TX,RX,TX:
> > this way vq->queue_index / 2 gives you the
> > queue pair number, no need to read numtxqs. On the other hand, it makes
> the
> > #RX==#TX assumption even more entrenched.
> 
> OK. I was following how many drivers were allocating RX and TX's
> together - eg ixgbe_adapter has tx_ring and rx_ring arrays; bnx2
> has rx_buf_ring and tx_buf_ring arrays, etc.

That's fine. I am only talking about the VQ numbers.

> Also, vhost has some
> code that processes tx first before rx (e.g. vhost_net_stop/flush),

No idea why did I do it this way. I don't think it matters.

> so this approach seemed helpful.
> I am OK either way, what do you
> suggest?

We get less code generated but also less flexibility.
I am not sure, I'll play around with code, for now
let's keep it as is.


> > > +		 err = vi->vdev->config->find_vqs(vi->vdev, totalvqs, vqs,
> callbacks,
> > > +		 		 		 		 		  (const char
> **)names);
> > > +		 if (err)
> > > +		 		 goto free_params;
> > > +
> >
> > This would use up quite a lot of vectors. However,
> > tx interrupt is, in fact, slow path. So, assuming we don't have
> > enough vectors to use per vq, I think it's a good idea to
> > support reducing MSI vector usage by mapping all TX VQs to the same
> vector
> > and separate vectors for RX.
> > The hypervisor actually allows this, but we don't have an API at the
> virtio
> > level to pass that info to virtio pci ATM.
> > Any idea what a good API to use would be?
> 
> Yes, it is a waste to have these vectors for tx ints. I initially
> thought of adding a flag to virtio_device to pass to vp_find_vqs,
> but it won't work, so a new API is needed. I can work with you on
> this in the background if you like.


OK. For starters, how about we change find_vqs to get a structure?  Then
we can easily add flags that tell us that some interrupts are rare.

diff --git a/include/linux/virtio_config.h b/include/linux/virtio_config.h
index 800617b..2b765bb 100644
--- a/include/linux/virtio_config.h
+++ b/include/linux/virtio_config.h
@@ -78,7 +78,14 @@
  *	This gives the final feature bits for the device: it can change
  *	the dev->feature bits if it wants.
  */
+
 typedef void vq_callback_t(struct virtqueue *);
+struct virtqueue_info {
+	struct virtqueue *vq;
+	vq_callback_t *callback;
+	const char *name;
+};
+
 struct virtio_config_ops {
 	void (*get)(struct virtio_device *vdev, unsigned offset,
 		    void *buf, unsigned len);
@@ -88,9 +95,7 @@ struct virtio_config_ops {
 	void (*set_status)(struct virtio_device *vdev, u8 status);
 	void (*reset)(struct virtio_device *vdev);
 	int (*find_vqs)(struct virtio_device *, unsigned nvqs,
-			struct virtqueue *vqs[],
-			vq_callback_t *callbacks[],
-			const char *names[]);
+			struct virtqueue_info vq_info[]);
 	void (*del_vqs)(struct virtio_device *);
 	u32 (*get_features)(struct virtio_device *vdev);
 	void (*finalize_features)(struct virtio_device *vdev);


> > > +		 for (i = 0; i < numtxqs; i++) {
> > > +		 		 vi->rq[i]->rvq = vqs[i];
> > > +		 		 vi->sq[i]->svq = vqs[i + numtxqs];
> >
> > This logic is spread all over. We need some kind of macro to
> > get queue number of vq number and back.
> 
> Will add this.
> 
> > > +		 if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)) {
> > > +		 		 vi->cvq = vqs[i + numtxqs];
> > > +
> > > +		 		 if (virtio_has_feature(vi->vdev,
> VIRTIO_NET_F_CTRL_VLAN))
> > > +		 		 		 vi->dev->features |=
> NETIF_F_HW_VLAN_FILTER;
> >
> > This bit does not seem to belong in initialize_vqs.
> 
> I will move it back to probe.
> 
> > > +		 err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
> > > +		 		 		 		 offsetof(struct
> virtio_net_config, numtxqs),
> > > +		 		 		 		 &numtxqs);
> > > +
> > > +		 /* We need atleast one txq */
> > > +		 if (err || !numtxqs)
> > > +		 		 numtxqs = 1;
> >
> > err is okay, but should we just fail on illegal values?
> > Or change the semantics:
> >	n = 0;
> >	err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
> >				offsetof(struct virtio_net_config, numtxqs),
> >				&n);
> >	numtxq = n + 1;
> 
> Will this be better:
> 	int num_queue_pairs = 2;
> 	int numtxqs;
> 
> 	err = virtio_config_val(vdev, VIRTIO_NET_F_MULTIQUEUE,
> 				offsetof(struct virtio_net_config,
> 					 num_queue_pairs), &num_queue_pairs);
> 	<ignore error, if any>
> 	numtxqs = num_queue_pairs / 2;
> 
> > > +		 if (numtxqs > VIRTIO_MAX_TXQS)
> > > +		 		 return -EINVAL;
> >
> > Do we strictly need this?
> > I think we should just use whatever hardware has,
> > or alternatively somehow ignore the unused queues
> > (easy for tx, not sure about rx).
> 
> vq's are matched between qemu, virtio-net and vhost. Isn't some check
> required that userspace has not passed a bad value?


For virtio, I'm not too concerned: qemu can already easily
crash the guest :)
For vhost yes, but I'm concerned that even with 16 VQs we are
drinking a lot of resources already. I would be happier
if we had a file descriptor per VQs pair in some way.
The the amount of memory userspace can use up is
limited by the # of file descriptors.

> > > +		 		 if (vi->rq[i]->num == 0) {
> > > +		 		 		 err = -ENOMEM;
> > > +		 		 		 goto free_recv_bufs;
> > > +		 		 }
> > >  		 }
> > If this fails for vq > 0, you have to detach bufs.
> 
> Right, will fix this.
> 
> > >  free_vqs:
> > > +		 for (i = 0; i < numtxqs; i++)
> > > +		 		 cancel_delayed_work_sync(&vi->rq[i]->refill);
> > >  		 vdev->config->del_vqs(vdev);
> > > -free:
> > > +		 free_rq_sq(vi);
> >
> > If we have a wrapper to init all vqs, pls add a wrapper to clean up
> > all vqs as well.
> 
> Will add that.
> 
> > > +		 for (i = 0; i < vi->numtxqs; i++) {
> > > +		 		 struct virtqueue *rvq = vi->rq[i]->rvq;
> > > +
> > > +		 		 while (1) {
> > > +		 		 		 buf = virtqueue_detach_unused_buf
> (rvq);
> > > +		 		 		 if (!buf)
> > > +		 		 		 		 break;
> > > +		 		 		 if (vi->mergeable_rx_bufs || vi->
> big_packets)
> > > +		 		 		 		 give_pages(vi->rq[i],
> buf);
> > > +		 		 		 else
> > > +		 		 		 		 dev_kfree_skb(buf);
> > > +		 		 		 --vi->rq[i]->num;
> > > +		 		 }
> > > +		 		 BUG_ON(vi->rq[i]->num != 0);
> > >  		 }
> > > -		 BUG_ON(vi->num != 0);
> > > +
> > > +		 free_rq_sq(vi);
> >
> >
> > This looks wrong here. This function should detach
> > and free all bufs, not internal malloc stuff.
> 
> That is being done by free_receive_buf after free_unused_bufs()
> returns. I hope this addresses your point.
> 
> > I think we should have free_unused_bufs that handles
> > a single queue, and call it in a loop.
> 
> OK, so define free_unused_bufs() as:
> 
> static void free_unused_bufs(struct virtnet_info *vi, struct virtqueue
> *svq,
> 			     struct virtqueue *rvq)
> {
> 	/* Use svq and rvq with the remaining code unchanged */
> }
> 
> Thanks,
> 
> - KK

Not sure I understand. I am just suggesting
adding symmetrical functions like init/cleanup
alloc/free etc instead of adding stuff in random
functions that just happens to be called at the right time.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html