On Fri, 5 Jun 2009 02:13:20 am Michael S. Tsirkin wrote: > I out up a copy at http://www.linux-kvm.org/page/Networking_Performance as > well, and intend to dump updates there from time to time. Hi Michael, Sorry for the delay. I'm weaning myself off my virtio work, but virtio_net performance is an issue which still needs lots of love. BTW a non-wiki on the wiki?. You should probably rename it to "MST_Networking_Performance" or allow editing :) > - skbs in flight are kept in send queue linked list, > so that we can flush them when device is removed > [ mst: optimization idea: virtqueue already tracks > posted buffers. Add flush/purge operation and use that instead? Interesting idea, but not really an optimization. (flush_buf() which does a get_buf() but for unused buffers). > ] - skb is reformatted to scattergather format > [ mst: idea to try: this does a copy for skb head, > which might be costly especially for small/linear packets. > Try to avoid this? Might need to tweak virtio interface. > ] There's no copy here that I can see? > - network driver adds the packet buffer on TX ring > - network driver does a kick which causes a VM exit > [ mst: any way to mitigate # of VM exits here? > Possibly could be done on host side as well. ] > [ markmc: All of our efforts there have been on the host side, I think > that's preferable than trying to do anything on the guest side. > ] The current theoretical hole is that the host suppresses notifications using the VIRTIO_AVAIL_F_NO_NOTIFY flag, but we can get a number of notifications in before it gets to that suppression. You can use a counter to improve this: you only notify when they're equal, and inc when you notify. That way you suppress further notifications even if the other side takes ages to wake up. In practice, this shouldn't be played with until we have full aio (or equiv in kernel) for other side: host xmit tends to be too fast at the moment and we get a notification per packet anyway. > - Full queue: > we keep a single extra skb around: > if we fail to transmit, we queue it > [ mst: idea to try: what does it do to > performance if we queue more packets? ] Bad idea!! We already have two queues, this is a third. We should either stop the queue before it gets full, or fix TX_BUSY handling. I've been arguing on netdev for the latter (see thread"[PATCH 2/4] virtio_net: return NETDEV_TX_BUSY instead of queueing an extra skb."). > [ markmc: the queue might soon be going away: > 200905292346.04815.rusty@xxxxxxxxxxxxxxx Ah, yep, that one. > http://archive.netbsd.se/?ml=linux-netdev&a=2009-05&m=10788575 ] > > - We get each buffer from host as it is completed and free it > - TX interrupts are only enabled when queue is stopped, > and when it is originally created (we disable them on completion) > [ mst: idea: second part is probably unintentional. > todo: we probably should disable interrupts when device is > created. ] Yep, minor wart. > - We poll for buffer completions: > 1. Before each TX 2. On a timer tasklet (unless 3 is supported) > 3. When host sends us interrupt telling us that the queue is > empty [ mst: idea to try: instead of empty, enable send interrupts on xmit > when buffer is almost full (e.g. at least half empty): we are running out > of buffers, it's important to free them ASAP. Can be done from host or from > guest. ] > [ Rusty proposing that we don't need (2) or (3) if the skbs are > orphaned before start_xmit(). See subj "net: skb_orphan on > dev_hard_start_xmit".] [ rusty also seems to be suggesting that disabling > VIRTIO_F_NOTIFY_ON_EMPTY on the host should help the case where the host > out-paces the guest ] Yes, that's more fruitful. > - Each skb has a 128 byte buffer at head and a single page for > data. Only full pages are passed to virtio buffers. > [ mst: for large packets, managing the 128 head buffers is wasted > effort. Try allocating skbs on rcv path when needed. ]. > [ mst: to clarify the previos suggestion: I am talking about > merging here. We currently allocate skbs and pages for them. If a > packet spans multiple pages, we discard the extra skbs. Instead, let's > allocate pages but not skbs. Allocate and fill skbs on receive path. ] Yep. There's another issue here, which is alignment: packets which get placed into pages are misaligned (that 14 byte ethernet header). We should add a feature to allow the host to say "I've skipped this many bytes at the front". > - Buffers are replenished after packet is received, > when number of buffers becomes low (below 1/2 max). > This serves to reduce the number of kicks (VMexits) for RX. > [ mst: code might become simpler if we add buffers > immediately, but don't kick until later] > [ markmc: possibly. batching this buffer allocation might be > introducing more unpredictability to benchmarks too - i.e. there isn't > a fixed per-packet overhead, some packets randomly have a higher overhead] > on failure to allocate in atomic context we simply stop > and try again on next recv packet. > [mst: there's a fixme that this fails if we complete run out of > buffers, should be handled by timer. could be a thread as well (allocate > with GFP_KERNEL). > idea: might be good for performance anyway. ] Yeah, this "batched packet add" is completely unscientific. The host will be ignoring notifications anyway, so it shouldn't win anything AFAICT. Ditch it and benchmark. > After adding buffers, we do a kick. > [ mst: test whether this optimization works: recv kicks should be > rare ] Outstanding buffers are kept on recv linked list. > [ mst: optimization idea: virtqueue already tracks > posted buffers. Add flush operation and use that instead. ] Don't understand this comment? > - recv is done with napi: on recv interrupt, disable interrupts > poll until queue is empty, enable when it's empty > [mst: test how well does this work. should get 1 interrupt per > N packets. what is N?] It works if the guest is outpacing the host, but in practice I had trouble getting above about 2:1. I've attached a spreadsheet showing the results of various tests using lguest. You can see the last one "lguest:net-delay-for- more-output.patch" where I actually inserted a silly 50 usec delay before sending the receive interrupt: 47k irqs for 1M packets is great, too bad about the latency :) > [mst: idea: implement interrupt coalescing? ] lguest does this in the host, with mixed results. Here's the commentry from my lguest:reduce_triggers-on-recv.patch (which is queued for linux-next as I believe it's the right thing even though win is in the noise). lguest: try to batch interrupts on network receive Rather than triggering an interrupt every time, we only trigger an interrupt when there are no more incoming packets (or the recv queue is full). However, the overhead of doing the select to figure this out is measurable: 1M pings goes from 98 to 104 seconds, and 1G Guest->Host TCP goes from 3.69 to 3.94 seconds. It's close to the noise though. I tested various timeouts, including reducing it as the number of pending packets increased, timing a 1 gigabyte TCP send from Guest -> Host and Host -> Guest (GSO disabled, to increase packet rate). // time tcpblast -o -s 65536 -c 16k 192.168.2.1:9999 > /dev/null Timeout Guest->Host Pkts/irq Host->Guest Pkts/irq Before 11.3s 1.0 6.3s 1.0 0 11.7s 1.0 6.6s 23.5 1 17.1s 8.8 8.6s 26.0 1/pending 13.4s 1.9 6.6s 23.8 2/pending 13.6s 2.8 6.6s 24.1 5/pending 14.1s 5.0 6.6s 24.4 > [mst: some architectures (with expensive unaligned DMA) override > NET_IP_ALIGN. since we don't really do DMA, we probably should use > alignment of 2 always] That's unclear: what if the host is doing DMA? > [ mst: question: there's a FIXME to avoid modulus in the math. > since num is a power of 2, isn't this just & (num - 1)?] Exactly. > Polling buffer: > we look at vq index and use that to find the next completed buffer > the pointer to data (skb) is retrieved and returned to user > [ mst: clearing data is only needed for debugging. > try removing this write - cache will be cleaner? ] It's our only way of detecting issues with hosts. We have reports of BAD_RING being triggered (unf. not reproducible). > TX: > We poll for TX packets in 2 ways > - On timer event (see below) > - When we get a kick from guest > At this point, we disable further notifications, > and start a timer. Notifications are reenabled after this. > This is designed to reduce the number of VMExits due to TX. > [ markmc: tried removing the timer. > It seems to really help some workloads. E.g. on RHEL: > http://markmc.fedorapeople.org/virtio-netperf/2009-04-15/ > on fedora removing timer has no major effect either way: > > http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/g-h-tput-04-no-tx- >timer.html ] lguest went fully multithreaded, dropped timer hack. Much nicer, and faster. (See second point on graph). Timers are a hack because we're not async, so fixing the real problem avoids that optimization guessing game entirely. > Packets are polled from virtio ring, walking descriptor linked list. > [ mst: optimize for completing in order? ] > Packet addresses are converted to guest iovec, using > cpu_physical_memory_map > [ mst: cpu_physical_memory_map could be optimized > to only handle what we actually use this for: > single page in RAM ] Anthony had a patch for this IIRC. > Interrupts will be reported to eventfd descriptors, and device will poll > eventfd descriptors to get kicks from guest. This is definitely a win. AFAICT you can inject interrupts into the guest from a separate thread today in KVM, too, so there's no core reason why devices can't be completely async with this one change. Cheers, Rusty.
Attachment:
results3.gnumeric
Description: application/gnumeric