Re: TODO list for qemu+KVM networking performance v2

Dor Laor <dlaor@xxxxxxxxxx> · Wed, 10 Jun 2009 09:26:31 +0300



Rusty Russell wrote:
> On Fri, 5 Jun 2009 02:13:20 am Michael S. Tsirkin wrote:
>   
>> I out up a copy at http://www.linux-kvm.org/page/Networking_Performance as
>> well, and intend to dump updates there from time to time.
>>     
>
> Hi Michael,
>
>   Sorry for the delay.  I'm weaning myself off my virtio work, but virtio_net 
> performance is an issue which still needs lots of love.  
>
> BTW a non-wiki on the wiki?.  You should probably rename it to 
> "MST_Networking_Performance" or allow editing :)
>
>   
>> 	- skbs in flight are kept in send queue linked list,
>> 	  so that we can flush them when device is removed
>> 	  [ mst: optimization idea: virtqueue already tracks
>>             posted buffers. Add flush/purge operation and use that instead?
>>     
>
> Interesting idea,  but not really an optimization.  (flush_buf() which does a 
> get_buf() but for unused buffers).
>
>   
>> ] - skb is reformatted to scattergather format
>>           [ mst: idea to try: this does a copy for skb head,
>>             which might be costly especially for small/linear packets.
>> 	    Try to avoid this? Might need to tweak virtio interface.
>>           ]
>>     
>
> There's no copy here that I can see?
>
>   
>>         - network driver adds the packet buffer on TX ring
>> 	- network driver does a kick which causes a VM exit
>>           [ mst: any way to mitigate # of VM exits here?
>>             Possibly could be done on host side as well. ]
>> 	  [ markmc: All of our efforts there have been on the host side, I think
>>             that's preferable than trying to do anything on the guest side.
>> ]
>>     
>
> The current theoretical hole is that the host suppresses notifications using 
> the VIRTIO_AVAIL_F_NO_NOTIFY flag, but we can get a number of notifications in 
> before it gets to that suppression.  You can use a counter to improve this: 
> you only notify when they're equal, and inc when you notify.  That way you 
> suppress further notifications even if the other side takes ages to wake up.  
> In practice, this shouldn't be played with until we have full aio (or equiv in 
> kernel) for other side: host xmit tends to be too fast at the moment and we 
> get a notification per packet anyway.
>   
Xen ring has the exact optimization for ages. imho we should have it 
too, regardless of aio.
It reduces #vmexits/spurious wakeups and it is very simple to implement.

>   
>> 	- Full queue:
>> 		we keep a single extra skb around:
>> 			if we fail to transmit, we queue it
>> 			[ mst: idea to try: what does it do to
>>                           performance if we queue more packets? ]
>>     
>
> Bad idea!!  We already have two queues, this is a third.  We should either 
> stop the queue before it gets full, or fix TX_BUSY handling.  I've been arguing 
> on netdev for the latter (see thread"[PATCH 2/4] virtio_net: return 
> NETDEV_TX_BUSY instead of queueing an extra skb.").
>
>   
>> 	        [ markmc: the queue might soon be going away:
>>                    200905292346.04815.rusty@xxxxxxxxxxxxxxx
>>     
>
> Ah, yep, that one.
>
>   
>> http://archive.netbsd.se/?ml=linux-netdev&a=2009-05&m=10788575 ]
>>
>> 	- We get each buffer from host as it is completed and free it
>>         - TX interrupts are only enabled when queue is stopped,
>>           and when it is originally created (we disable them on completion)
>>           [ mst: idea: second part is probably unintentional.
>>             todo: we probably should disable interrupts when device is
>> created. ]
>>     
>
> Yep, minor wart.
>
>   
>> - We poll for buffer completions:
>> 	  1. Before each TX 2. On a timer tasklet (unless 3 is supported)
>>           3. When host sends us interrupt telling us that the queue is
>> empty [ mst: idea to try: instead of empty, enable send interrupts on xmit
>> when buffer is almost full (e.g. at least half empty): we are running out
>> of buffers, it's important to free them ASAP. Can be done from host or from
>> guest. ]
>>           [ Rusty proposing that we don't need (2) or (3) if the skbs are
>> orphaned before start_xmit(). See subj "net: skb_orphan on
>> dev_hard_start_xmit".] [ rusty also seems to be suggesting that disabling
>> VIRTIO_F_NOTIFY_ON_EMPTY on the host should help the case where the host
>> out-paces the guest ]
>>     
>
> Yes, that's more fruitful.
>
>   
>>         - Each skb has a 128 byte buffer at head and a single page for
>> data. Only full pages are passed to virtio buffers.
>>           [ mst: for large packets, managing the 128 head buffers is wasted
>>             effort. Try allocating skbs on rcv path when needed. ].
>> 	    [ mst: to clarify the previos suggestion: I am talking about
>> 	    merging here.  We currently allocate skbs and pages for them. If a
>> packet spans multiple pages, we discard the extra skbs.  Instead, let's
>> allocate pages but not skbs. Allocate and fill skbs on receive path. ]
>>     
>
> Yep.  There's another issue here, which is alignment: packets which get placed 
> into pages are misaligned (that 14 byte ethernet header).  We should add a 
> feature to allow the host to say "I've skipped this many bytes at the front".
>
>   
>> 	- Buffers are replenished after packet is received,
>>           when number of buffers becomes low (below 1/2 max).
>>           This serves to reduce the number of kicks (VMexits) for RX.
>>           [ mst: code might become simpler if we add buffers
>>             immediately, but don't kick until later]
>> 	  [ markmc: possibly. batching this buffer allocation might be
>> 	    introducing more unpredictability to benchmarks too - i.e. there isn't
>> a fixed per-packet overhead, some packets randomly have a higher overhead]
>> on failure to allocate in atomic context we simply stop
>>           and try again on next recv packet.
>>           [mst: there's a fixme that this fails if we complete run out of
>> buffers, should be handled by timer. could be a thread as well (allocate
>> with GFP_KERNEL).
>>                 idea: might be good for performance anyway. ]
>>     
>
> Yeah, this "batched packet add" is completely unscientific.  The host will be 
> ignoring notifications anyway, so it shouldn't win anything AFAICT.  Ditch it 
> and benchmark.
>
>   
>> 	  After adding buffers, we do a kick.
>>           [ mst: test whether this optimization works: recv kicks should be
>> rare ] Outstanding buffers are kept on recv linked list.
>> 	  [ mst: optimization idea: virtqueue already tracks
>>             posted buffers. Add flush operation and use that instead. ]
>>     
>
> Don't understand this comment?
>
>   
>> 	- recv is done with napi: on recv interrupt, disable interrupts
>>           poll until queue is empty, enable when it's empty
>>          [mst: test how well does this work. should get 1 interrupt per
>>           N packets. what is N?]
>>     
>
> It works if the guest is outpacing the host, but in practice I had trouble 
> getting above about 2:1.  I've attached a spreadsheet showing the results of 
> various tests using lguest.  You can see the last one "lguest:net-delay-for-
> more-output.patch" where I actually inserted a silly 50 usec delay before 
> sending the receive interrupt: 47k irqs for 1M packets is great, too bad about 
> the latency :)
>
>   
>>          [mst: idea: implement interrupt coalescing? ]
>>     
>
> lguest does this in the host, with mixed results.  Here's the commentry from 
> my lguest:reduce_triggers-on-recv.patch (which is queued for linux-next as I 
> believe it's the right thing even though win is in the noise).
>
> lguest: try to batch interrupts on network receive
>
> Rather than triggering an interrupt every time, we only trigger an
> interrupt when there are no more incoming packets (or the recv queue
> is full).
>
> However, the overhead of doing the select to figure this out is
> measurable: 1M pings goes from 98 to 104 seconds, and 1G Guest->Host
> TCP goes from 3.69 to 3.94 seconds.  It's close to the noise though.
>
> I tested various timeouts, including reducing it as the number of
> pending packets increased, timing a 1 gigabyte TCP send from Guest ->
> Host and Host -> Guest (GSO disabled, to increase packet rate).
>
> // time tcpblast -o -s 65536 -c 16k 192.168.2.1:9999 > /dev/null
>
> Timeout		Guest->Host	Pkts/irq	Host->Guest	Pkts/irq
> Before		11.3s		1.0		6.3s		1.0
> 0		11.7s		1.0		6.6s		23.5
> 1		17.1s		8.8		8.6s		26.0
> 1/pending	13.4s		1.9		6.6s		23.8
> 2/pending	13.6s		2.8		6.6s		24.1
> 5/pending	14.1s		5.0		6.6s		24.4
>
>   
>> 	[mst: some architectures (with expensive unaligned DMA) override
>> NET_IP_ALIGN. since we don't really do DMA, we probably should use
>> alignment of 2 always]
>>     
>
> That's unclear: what if the host is doing DMA?
>
>   
>> 		[ mst: question: there's a FIXME to avoid modulus in the math.
>>                   since num is a power of 2, isn't this just & (num - 1)?]
>>     
>
> Exactly.
>
>   
>> 	Polling buffer:
>> 		we look at vq index and use that to find the next completed buffer
>> 		the pointer to data (skb) is retrieved and returned to user
>> 		[ mst: clearing data is only needed for debugging.
>>                   try removing this write - cache will be cleaner? ]
>>     
>
> It's our only way of detecting issues with hosts.  We have reports of BAD_RING 
> being triggered (unf. not reproducible).
>
>   
>> TX:
>> 	We poll for TX packets in 2 ways
>> 	- On timer event (see below)
>> 	- When we get a kick from guest
>> 	  At this point, we disable further notifications,
>> 	  and start a timer. Notifications are reenabled after this.
>> 	  This is designed to reduce the number of VMExits due to TX.
>> 	  [ markmc: tried removing the timer.
>>             It seems to really help some workloads. E.g. on RHEL:
>>             http://markmc.fedorapeople.org/virtio-netperf/2009-04-15/
>>             on fedora removing timer has no major effect either way:
>> 	   
>> http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/g-h-tput-04-no-tx-
>> timer.html ]
>>     
>
> lguest went fully multithreaded, dropped timer hack.  Much nicer, and faster.  
> (See second point on graph).  Timers are a hack because we're not async, so 
> fixing the real problem avoids that optimization guessing game entirely.
>
>   
>> 	Packets are polled from virtio ring, walking descriptor linked list.
>> 	[ mst: optimize for completing in order? ]
>> 	Packet addresses are converted to guest iovec, using
>> 	cpu_physical_memory_map
>> 	[ mst: cpu_physical_memory_map could be optimized
>>           to only handle what we actually use this for:
>>           single page in RAM ]
>>     
>
> Anthony had a patch for this IIRC.
>
>   
>> Interrupts will be reported to eventfd descriptors, and device will poll
>> eventfd descriptors to get kicks from guest.
>>     
>
> This is definitely a win.  AFAICT you can inject interrupts into the guest from 
> a separate thread today in KVM, too, so there's no core reason why devices 
> can't be completely async with this one change.
>
> Cheers,
> Rusty.
>
>
>   

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/virtualization