Re: Network performance with small packets - continued

Tom Lendacky <tahm@xxxxxxxxxxxxxxxxxx> · Wed, 9 Mar 2011 14:11:07 -0600

Here are the results again with the addition of the interrupt rate that 
occurred on the guest virtio_net device:

Here is the KVM baseline (average of six runs):
  Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
  Exits: 148,444.58 Exits/Sec
  TxCPU: 2.40%  RxCPU: 99.35%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About 42% of baremetal.

Delayed freeing of TX buffers (average of six runs):
  Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
  Exits: 142,681.67 Exits/Sec
  TxCPU: 2.78%  RxCPU: 99.36%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 4% increase over baseline and about 44% of baremetal.

Delaying kick_notify (kick every 5 packets -average of six runs):
  Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
  Exits: 102,587.28 Exits/Sec
  TxCPU: 3.03%  RxCPU: 99.33%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 23% increase over baseline and about 52% of baremetal.

Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs):
  Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
  Exits: 62,603.37 Exits/Sec
  TxCPU: 3.73%  RxCPU: 98.52%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 77% increase over baseline and about 74% of baremetal.

On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
> On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
> > We've been doing some more experimenting with the small packet network
> > performance problem in KVM.  I have a different setup than what Steve D.
> > was using so I re-baselined things on the kvm.git kernel on both the
> > host and guest with a 10GbE adapter.  I also made use of the
> > virtio-stats patch.
> > 
> > The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
> > adapters (the first connected to a 1GbE adapter and a LAN, the second
> > connected to a 10GbE adapter that is direct connected to another system
> > with the same 10GbE adapter) running the kvm.git kernel.  The test was a
> > TCP_RR test with 100 connections from a baremetal client to the KVM
> > guest using a 256 byte message size in both directions.
> > 
> > I used the uperf tool to do this after verifying the results against
> > netperf. Uperf allows the specification of the number of connections as
> > a parameter in an XML file as opposed to launching, in this case, 100
> > separate instances of netperf.
> > 
> > Here is the baseline for baremetal using 2 physical CPUs:
> >   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
> >   TxCPU: 7.88%  RxCPU: 99.41%
> > 
> > To be sure to get consistent results with KVM I disabled the
> > hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
> > ethernet adapter interrupts (this resulted in runs that differed by only
> > about 2% from lowest to highest).  The fact that pinning is required to
> > get consistent results is a different problem that we'll have to look
> > into later...
> > 
> > Here is the KVM baseline (average of six runs):
> >   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
> >   Exits: 148,444.58 Exits/Sec
> >   TxCPU: 2.40%  RxCPU: 99.35%
> > 
> > About 42% of baremetal.
> 
> Can you add interrupt stats as well please?
> 
> > empty.  So I coded a quick patch to delay freeing of the used Tx buffers
> > until more than half the ring was used (I did not test this under a
> > stream condition so I don't know if this would have a negative impact). 
> > Here are the results
> > 
> > from delaying the freeing of used Tx buffers (average of six runs):
> >   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
> >   Exits: 142,681.67 Exits/Sec
> >   TxCPU: 2.78%  RxCPU: 99.36%
> > 
> > About a 4% increase over baseline and about 44% of baremetal.
> 
> Hmm, I am not sure what you mean by delaying freeing.
> I think we do have a problem that free_old_xmit_skbs
> tries to flush out the ring aggressively:
> it always polls until the ring is empty,
> so there could be bursts of activity where
> we spend a lot of time flushing the old entries
> before e.g. sending an ack, resulting in
> latency bursts.
> 
> Generally we'll need some smarter logic,
> but with indirect at the moment we can just poll
> a single packet after we post a new one, and be done with it.
> Is your patch something like the patch below?
> Could you try mine as well please?
> 
> > This spread out the kick_notify but still resulted in alot of them.  I
> > decided to build on the delayed Tx buffer freeing and code up an
> > "ethtool" like coalescing patch in order to delay the kick_notify until
> > there were at least 5 packets on the ring or 2000 usecs, whichever
> > occurred first.  Here are the
> > 
> > results of delaying the kick_notify (average of six runs):
> >   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
> >   Exits: 102,587.28 Exits/Sec
> >   TxCPU: 3.03%  RxCPU: 99.33%
> > 
> > About a 23% increase over baseline and about 52% of baremetal.
> > 
> > Running the perf command against the guest I noticed almost 19% of the
> > time being spent in _raw_spin_lock.  Enabling lockstat in the guest
> > showed alot of contention in the "irq_desc_lock_class". Pinning the
> > virtio1-input interrupt to a single cpu in the guest and re-running the
> > last test resulted in
> > 
> > tremendous gains (average of six runs):
> >   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
> >   Exits: 62,603.37 Exits/Sec
> >   TxCPU: 3.73%  RxCPU: 98.52%
> > 
> > About a 77% increase over baseline and about 74% of baremetal.
> > 
> > Vhost is receiving a lot of notifications for packets that are to be
> > transmitted (over 60% of the packets generate a kick_notify).  Also, it
> > looks like vhost is sending a lot of notifications for packets it has
> > received before the guest can get scheduled to disable notifications and
> > begin processing the packets
> 
> Hmm, is this really what happens to you?  The effect would be that guest
> gets an interrupt while notifications are disabled in guest, right? Could
> you add a counter and check this please?
> 
> Another possible thing to try would be these old patches to publish used
> index from guest to make sure this double interrupt does not happen:
>  [PATCHv2] virtio: put last seen used index into ring itself
>  [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature
> 
> > resulting in some lock contention in the guest (and
> > high interrupt rates).
> > 
> > Some thoughts for the transmit path...  can vhost be enhanced to do some
> > adaptive polling so that the number of kick_notify events are reduced and
> > replaced by kick_no_notify events?
> 
> Worth a try.
> 
> > Comparing the transmit path to the receive path, the guest disables
> > notifications after the first kick and vhost re-enables notifications
> > after completing processing of the tx ring.
> 
> Is this really what happens? I though the host disables notifications
> after the first kick.
> 
> >  Can a similar thing be done for the
> > 
> > receive path?  Once vhost sends the first notification for a received
> > packet it can disable notifications and let the guest re-enable
> > notifications when it has finished processing the receive ring.  Also,
> > can the virtio-net driver do some adaptive polling (or does napi take
> > care of that for the guest)?
> 
> Worth a try. I don't think napi does anything like this.
> 
> > Running the same workload on the same configuration with a different
> > hypervisor results in performance that is almost equivalent to baremetal
> > without doing any pinning.
> > 
> > Thanks,
> > Tom Lendacky
> 
> There's no need to flush out all used buffers
> before we post more for transmit: with indirect,
> just a single one is enough. Without indirect we'll
> need more possibly, but just for testing this should
> be enough.
> 
> Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> 
> ---
> 
> Note: untested.
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 82dba5a..ebe3337 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
> virtnet_info *vi) struct sk_buff *skb;
>  	unsigned int len, tot_sgs = 0;
> 
> -	while ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
> +	if ((skb = virtqueue_get_buf(vi->svq, &len)) != NULL) {
>  		pr_debug("Sent skb %p\n", skb);
>  		vi->dev->stats.tx_bytes += skb->len;
>  		vi->dev->stats.tx_packets++;
> -		tot_sgs += skb_vnet_hdr(skb)->num_sg;
> +		tot_sgs = 2+MAX_SKB_FRAGS;
>  		dev_kfree_skb_any(skb);
>  	}
>  	return tot_sgs;
> @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev);
>  	int capacity;
> 
> -	/* Free up any pending old buffers before queueing new ones. */
> -	free_old_xmit_skbs(vi);
> -
>  	/* Try to transmit */
>  	capacity = xmit_skb(vi, skb);
> 
> @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
> struct net_device *dev) skb_orphan(skb);
>  	nf_reset(skb);
> 
> +	/* Free up any old buffers so we can queue new ones. */
> +	if (capacity < 2+MAX_SKB_FRAGS)
> +		capacity += free_old_xmit_skbs(vi);
> +
>  	/* Apparently nice girls don't return TX_BUSY; stop the queue
>  	 * before it gets out of hand.  Naturally, this wastes entries. */
>  	if (capacity < 2+MAX_SKB_FRAGS) {
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html