Re: vhost questions.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Michael S. Tsirkin" <mst@xxxxxxxxxx> writes:
> On Thu, Mar 14, 2013 at 05:08:22PM +1030, Rusty Russell wrote:
>> "Michael S. Tsirkin" <mst@xxxxxxxxxx> writes:
>> > On Wed, Mar 13, 2013 at 05:19:51PM +1030, Rusty Russell wrote:
>> >> OK, I've been trying to read the vhost and vhost net code, and I'm
>> >> struggling with basic questions.
>> >> 
>> >> 1) Why do we allow userspace to change an already-set-up vhost device?
>> >>    Why not have:
>> >>         open()
>> >>         ioctl(VHOST_GET_FEATURES)
>> >>         ioctl(VHOST_SET_VRING) x n (sets num, addresses, kick/call/err fds)
>> >>         ioctl(VHOST_NET_SETUP)
>> >>         ...
>> >>         close()
>> >> 
>> >>    You're not allowed to call things twice: to reset, you close and
>> >>    reopen.  That would save a lot of code which I'm not sure is correct
>> >>    anyway.
>> >
>> > This won't work for the usecases we have in qemu.
>> > The way things are setup currently, qemu typically does not have
>> > the rights to open any char devices, instead it is passed an fd.
>> 
>> Ah, because creating a vhost fd is privileged?  Why?  Because it creates
>> a kernel thread?  Why?
>> 
>> Using a model where a userspace process/thread manages the ring (even if
>> it spends 99.99% of its time in the kernel) would avoid this.  This is
>> how KVM itself works, it's no longer privileged.
>
> KVM kind of has to have a thread per VCPU, no other options.

Not at all!  It could spin off a helper kernel thread to do the work and
come up with some IPC.  But there's a damn good reason that it didn't,
nor did lguest.

That same reasoning applies here, too.  The fact that you create a
kernel thread then give it the same context and cgroup etc should be the
clue!

> Here I don't think we know what the best threading model is.
> We used to have a per-CPU thread, switched to thread pair vq *pair* and
> there are patches to go back to per-CPU threads, they seem to improve
> scalability though there are still some open issues wrt cgroups.
> Using a kernel thread means we can control this without touching userspace.

But kernelspace is the wrong place to make these decisions.

Let's try to imagine what it'd be like if we didn't do it this way, say
a /dev/vhost-net2 device.  You use an ioctl to set up the memory
mapping, the (tun) fd, the location of a xmit ring and a recv ring (you
specify one or both... we could have separate xmit and recv devices, but
that seems a bit overkill).

Now:
1) Simplest case, you want a dumb implementation with no extra threads
   in your code.  Using select(), if tun fd is readable and vhost-net2 fd is
   writable, call ioctl(VHOST_NET_RECV), which processes as many packets
   as it can.  If tun fd is writable and vhost-net2 fd is readable, call
   ioctl(VHOST_NET_XMIT).

2) You want an I/O thread in your code?  That's easy, too.  You might
   skip the writable fd checks, and just let the ioctl() block.

3) Want separate I/O threads?  Separate /dev/vhost-net2 fds, one one
   only sets the xmit ring, one only sets the recv ring.  They just
   call ioctl() over and over.

4) Multiqueue comes along?  It only makes sense with separate threads
   per queue, so open /dev/vhost-net2 and tunfd once for each thread.
   Do your per-cpu binding here.

Now, there may be issues if we actually coded this, but it *seems* sane.

>> A simple implementation which supports live migration would have to have
>> enough understanding to save and restore the net state.  With a tun
>> device (which handles all the header stuff), supporting networking is
>> *trivial*: see lguest (though it needs MRG_RXBUF support).  It's about
>> 100 lines.
>
> You assume networking is stopped while we migrate but this doesn't work
> as migration might take a long time.  So you have to have the full
> device, processing rings and sending/receiving packets all in userspace.

No I don't: 100 lines of code *is* a full networking device.

>> > +        * We also trigger polling periodically after each 16 packets
>> > +        * (the value 16 here is more or less arbitrary, it's tuned to trigger
>> > +        * less than 10% of times).
>> >
>> > If you have any suggestions on how to improve this,
>> 
>> But this code does *not* wake every 16 packets.

It wakes when the reference count hits 16, 32, etc.  But that reference
count will go up and down as packets are processed: in theory, it might
go 0, 1, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3... ?

>> Anyway, what happens when we wake it?  I'm assuming it's the net tx
>> worker, which will just return the used bufs to the guest?  Wouldn't it
>> be better to do that directly?  Maybe wake the tx worker if it needs to
>> wake the guest, if we can't do that here.
>
> Sorry I don't get what you mean here.

Let's check I've got this straight:

Packet is fully xmitted:
       -> skb_release_data -> vhost_zerocopy_callback -> vhost_poll_queue()

       Wakes up tx thread -> handle_tx_kick -> handle_tx
             -> vhost_zerocopy_signal_used -> vhost_add_used_and_signal()

But what if we kmapped the ring so we could access it in
vhost_zerocopy_callback?  We could skip the wakeup altogether, right?

And it happens that I just wrote vringh accessors that work on
kernelspace, so it should be easier than it was before... :)

>> But most importantly, it's *not* the host's problem to fix!  Just keep
>> the damn latency low.
>
> OK, let's try to rip it out and see what happens.

Thanks!  That'll make things much simpler for me to understand :)

>> >> 4) Why are the following fields in struct vhost_virtqueue, not struct
>> >>    vhost_net?
>> >> 
>> >> 	/* hdr is used to store the virtio header.
>> >> 	 * Since each iovec has >= 1 byte length, we never need more than
>> >> 	 * header length entries to store the header. */
>> >> 	struct iovec hdr[sizeof(struct virtio_net_hdr_mrg_rxbuf)];
>> >> 	size_t vhost_hlen;
>> >> 	size_t sock_hlen;
>> >> 	struct vring_used_elem *heads;
>> >> 
>> >>         ...
>> >
>> > Because these are used for both tx and rx.  While we handle both from
>> > the same thread at the moment, so we could reuse same fields, there are
>> > some patches by Anthony that process tx and rx in separate threads and
>> > sometimes it's a win. So I don't want to close the door on that.
>> 
>> But they're *net* specific: They don't belong in the vhost structure.
>
> Ah, I see what you mean. Yes, we really need vhost_net_virtqueue,
> we didn't have one and I let some net specific stuff seep through.
> We should do something like

Thanks.  I'll put that in my tree and start playing with it.

>> We do something like:
>> 
>>         err = tun_inject_skb(sock, skb);
>> 
>> Because the zerocopy_sg_from_iovec() in tun.c already has a comment
>> about putting it in the network core... in fact, it's cut & pasted into
>> macvtap :(
>
> Right. There's lots of code duplication in tun, macvtap and af_packet,
> we could factor code out, and then use it everywhere.
>
>> The msg thing for the callback is a nasty hack, AFAICT...
>
> Yes it's not pretty.

OK, I'll see what I can do.

> Sure, we have lots going on with vhost scsi ATM, but as long as changes
> are done in small steps I'll figure out conflicts if any without much
> trouble.

Yep, I think it can be done.  I've done some patches so far, and the
impact on tcm_vhost has been minimal...

Thanks,
Rusty.
_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[Index of Archives]     [KVM Development]     [Libvirt Development]     [Libvirt Users]     [CentOS Virtualization]     [Netdev]     [Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux