Re: vhost questions.

Rusty Russell <rusty@xxxxxxxxxxxxxxx> · Thu, 14 Mar 2013 17:08:22 +1030

"Michael S. Tsirkin" <mst@xxxxxxxxxx> writes:
> On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
>> OK, I've been trying to read the vhost and vhost net code, and I'm
>> struggling with basic questions.
>> 
>> 1) Why do we allow userspace to change an already-set-up vhost device?
>>    Why not have:
>>         open()
>>         ioctl(VHOST_GET_FEATURES)
>>         ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
>>         ioctl(VHOST_NET_SETUP)
>>         ...
>>         close()
>> 
>>    You're not allowed to call things twice: to reset, you close and
>>    reopen.  That would save a lot of code which I'm not sure is correct
>>    anyway.
>
> This won't work for the usecases we have in qemu.
> The way things are setup currently, qemu typically does not have
> the rights to open any char devices, instead it is passed an fd.

Ah, because creating a vhost fd is privileged?  Why?  Because it creates
a kernel thread?  Why?

Using a model where a userspace process/thread manages the ring (even if
it spends 99.99% of its time in the kernel) would avoid this.  This is
how KVM itself works, it's no longer privileged.

>> 2) Why do we implement write logging?  Shouldn't qemu just switch to
>>    emulation if it wants to track writes?
>
> We could do it this way.  Originally vhost didn't do logging,
> but Anthony felt it's cleaner this way.  We argued this way and that, in
> the end I agreed it's better to always process rings in vhost, leave
> userspace alone. For example this allows a simpler userspace that does
> not implement tx/rx ring processing at all.

We don't support things which don't exist.

A simple implementation which supports live migration would have to have
enough understanding to save and restore the net state.  With a tun
device (which handles all the header stuff), supporting networking is
*trivial*: see lguest (though it needs MRG_RXBUF support).  It's about
100 lines.

> We used to use a workqueue, but Tejun Heo required this switch.
> Basically we want everything run from one thread so we
> can apply cgroup limits to this thread. workqueue does not
> guarantee it - one thread is an implementation detail.

Right.

>> vhost/net.c questions:
>> 
>> 1) Why do we wake the net tx worker every time the reference count of
>>    the number of outstanding DMAs is a multiple of 16?
>
> Waking every time turns out to be wasteful. Waking only when nothing is
> outstanding turns out to be slow. The comment says it all really:
>
> +        * We also trigger polling periodically after each 16 packets
> +        * (the value 16 here is more or less arbitrary, it's tuned to trigger
> +        * less than 10% of times).
>
> If you have any suggestions on how to improve this,

But this code does *not* wake every 16 packets.

Anyway, what happens when we wake it?  I'm assuming it's the net tx
worker, which will just return the used bufs to the guest?  Wouldn't it
be better to do that directly?  Maybe wake the tx worker if it needs to
wake the guest, if we can't do that here.

>> 3) Why don't we simply use a slab allocator to allocate a structure for
>>    outstanding xmit dmas?  Wouldn't that be far simpler?  Would it be
>>    slower?
>
> Which structure for outstanding xmit dmas? ubufs?  We could use a cache
> maybe, plan slab is usually pretty slow.
>
> I think what you are really worried about is the upend/done index thing?
> It's not really there because of the array really.
> It's because we try to order the completion of zero copy buffers
> in the same order they are submitted. This was because I was worried about
> triggering fragmentation of the descriptor ring (Avi kept raising this
> concern).

I've never seen any evidence that caching is slowing us down.  Many
theories, but mainly our slowdowns are crap in qemu.

And for xmit rings, it's *REALLY DUMB* because there can be no
fragmentation with indirect buffers!

If fragmentation becomes a problem, it's easy to use a bitmap-based
allocator and avoid fragmentation.

But most importantly, it's *not* the host's problem to fix!  Just keep
the damn latency low.

>> 4) Why are the following fields in struct vhost_virtqueue, not struct
>>    vhost_net?
>> 
>> 	/* hdr is used to store the virtio header.
>> 	 * Since each iovec has >= 1 byte length, we never need more than
>> 	 * header length entries to store the header. */
>> 	struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
>> 	size_t vhost_hlen;
>> 	size_t sock_hlen;
>> 	struct vring_used_elem *heads;
>> 
>>         ...
>
> Because these are used for both tx and rx.  While we handle both from
> the same thread at the moment, so we could reuse same fields, there are
> some patches by Anthony that process tx and rx in separate threads and
> sometimes it's a win. So I don't want to close the door on that.

But they're *net* specific: They don't belong in the vhost structure.

And from my reading, hdr[] is written to on the xmit side, but it's not
used.  sock_hlen is entirely unused on xmit. heads[] is abused, and
should be its own structure.

>> 	/* vhost zerocopy support fields below: */
>> 	/* last used idx for outstanding DMA zerocopy buffers */
>> 	int upend_idx;
>> 	/* first used idx for DMA done zerocopy buffers */
>> 	int done_idx;
>> 	/* an array of userspace buffers info */
>> 	struct ubuf_info *ubuf_info;
>> 	/* Reference counting for outstanding ubufs.
>> 	 * Protected by vq mutex. Writers must also take device mutex. */
>> 	struct vhost_ubuf_ref *ubufs;
>
> Only used from tx. I was trying to keep the option of multiple
> tx rings in one vhost, and the option of rx zero copy open, but it's not
> a must.

Again, these are all very net specific.

>> 5) Why does vhost_net use the sendmsg interface for tun (and macvlan?)?
>>    It's not usable from userspace (it looks like if you can sendmsg to a
>>    tun device you can make it call an arbitrary pointer though).
>
> Yes we could switch to some other interface, but only if we
> deprecate the packet socket backend and remove it at some point.
>
>>    It would be better for us to create the skb, and have a special
>>    interface to inject it directly.  This way we can handle the
>>    UIO_MAXIOV limit ourselves (eg. by not doing zero copy on such stupid
>>    cases).
>
> Basically because userspace already configures the backend with existing
> tun ioctls. This way we don't need to introduce new interfaces for this,
> just virtio stuff for vhost net.  It's also good that we share code with
> tun, often both vhost and virtio processing can gain from same
> optimizations and new features. Happened in the past already.

Sorry, I wasn't clear here.  I mean instead of doing:

	err = sock->ops->recvmsg(NULL, sock, &msg,
				 sock_len, MSG_DONTWAIT | MSG_TRUNC);

We do something like:

        err = tun_inject_skb(sock, skb);

Because the zerocopy_sg_from_iovec() in tun.c already has a comment
about putting it in the network core... in fact, it's cut & pasted into
macvtap :(

The msg thing for the callback is a nasty hack, AFAICT...

> Basically not many people care about virt. The less virt specific
> code we have the better.  I lost count of times I had this dialog when I
> needed some help
> - we see this problem XYZ in host net stack
> - oh it's vhost I don't know what it does, sorry
> - actually it only builds up iovecs all networking is done by tun that
>   everyone uses
> - ok then let me take a look
>
>> 6) By calling tun_get_socket() and macvtap_get_socket() we are forcing
>>    those modules to load.  We should use symbol_get() to avoid this.
>
> Good idea.

I can create this patch.

>> In short, this code is a mess.  Assuming we can't just break the API,
>> we've got a long process ahead of us to get it into shape :(
>> 
>> Thanks,
>> Rusty.
>
> Zerocopy path is the most messy I think.  We should really consider
> removing the ordering of zero copy buffers. The rest I don't consider
> so messy, and it should be easy to cleanup zerocopy without
> breaking the API.

I've got some cleanup patches, but I'll keep going and send them to you
for review...

Thanks,
Rusty
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization