Rusty Russell wrote: > On Fri, 5 Jun 2009 02:13:20 am Michael S. Tsirkin wrote: > >> I out up a copy at http://www.linux-kvm.org/page/Networking_Performance as >> well, and intend to dump updates there from time to time. >> > > Hi Michael, > > Sorry for the delay. I'm weaning myself off my virtio work, but virtio_net > performance is an issue which still needs lots of love. > > BTW a non-wiki on the wiki?. You should probably rename it to > "MST_Networking_Performance" or allow editing :) > > >> - skbs in flight are kept in send queue linked list, >> so that we can flush them when device is removed >> [ mst: optimization idea: virtqueue already tracks >> posted buffers. Add flush/purge operation and use that instead? >> > > Interesting idea, but not really an optimization. (flush_buf() which does a > get_buf() but for unused buffers). > > >> ] - skb is reformatted to scattergather format >> [ mst: idea to try: this does a copy for skb head, >> which might be costly especially for small/linear packets. >> Try to avoid this? Might need to tweak virtio interface. >> ] >> > > There's no copy here that I can see? > > >> - network driver adds the packet buffer on TX ring >> - network driver does a kick which causes a VM exit >> [ mst: any way to mitigate # of VM exits here? >> Possibly could be done on host side as well. ] >> [ markmc: All of our efforts there have been on the host side, I think >> that's preferable than trying to do anything on the guest side. >> ] >> > > The current theoretical hole is that the host suppresses notifications using > the VIRTIO_AVAIL_F_NO_NOTIFY flag, but we can get a number of notifications in > before it gets to that suppression. You can use a counter to improve this: > you only notify when they're equal, and inc when you notify. That way you > suppress further notifications even if the other side takes ages to wake up. > In practice, this shouldn't be played with until we have full aio (or equiv in > kernel) for other side: host xmit tends to be too fast at the moment and we > get a notification per packet anyway. > Xen ring has the exact optimization for ages. imho we should have it too, regardless of aio. It reduces #vmexits/spurious wakeups and it is very simple to implement. > >> - Full queue: >> we keep a single extra skb around: >> if we fail to transmit, we queue it >> [ mst: idea to try: what does it do to >> performance if we queue more packets? ] >> > > Bad idea!! We already have two queues, this is a third. We should either > stop the queue before it gets full, or fix TX_BUSY handling. I've been arguing > on netdev for the latter (see thread"[PATCH 2/4] virtio_net: return > NETDEV_TX_BUSY instead of queueing an extra skb."). > > >> [ markmc: the queue might soon be going away: >> 200905292346.04815.rusty@xxxxxxxxxxxxxxx >> > > Ah, yep, that one. > > >> http://archive.netbsd.se/?ml=linux-netdev&a=2009-05&m=10788575 ] >> >> - We get each buffer from host as it is completed and free it >> - TX interrupts are only enabled when queue is stopped, >> and when it is originally created (we disable them on completion) >> [ mst: idea: second part is probably unintentional. >> todo: we probably should disable interrupts when device is >> created. ] >> > > Yep, minor wart. > > >> - We poll for buffer completions: >> 1. Before each TX 2. On a timer tasklet (unless 3 is supported) >> 3. When host sends us interrupt telling us that the queue is >> empty [ mst: idea to try: instead of empty, enable send interrupts on xmit >> when buffer is almost full (e.g. at least half empty): we are running out >> of buffers, it's important to free them ASAP. Can be done from host or from >> guest. ] >> [ Rusty proposing that we don't need (2) or (3) if the skbs are >> orphaned before start_xmit(). See subj "net: skb_orphan on >> dev_hard_start_xmit".] [ rusty also seems to be suggesting that disabling >> VIRTIO_F_NOTIFY_ON_EMPTY on the host should help the case where the host >> out-paces the guest ] >> > > Yes, that's more fruitful. > > >> - Each skb has a 128 byte buffer at head and a single page for >> data. Only full pages are passed to virtio buffers. >> [ mst: for large packets, managing the 128 head buffers is wasted >> effort. Try allocating skbs on rcv path when needed. ]. >> [ mst: to clarify the previos suggestion: I am talking about >> merging here. We currently allocate skbs and pages for them. If a >> packet spans multiple pages, we discard the extra skbs. Instead, let's >> allocate pages but not skbs. Allocate and fill skbs on receive path. ] >> > > Yep. There's another issue here, which is alignment: packets which get placed > into pages are misaligned (that 14 byte ethernet header). We should add a > feature to allow the host to say "I've skipped this many bytes at the front". > > >> - Buffers are replenished after packet is received, >> when number of buffers becomes low (below 1/2 max). >> This serves to reduce the number of kicks (VMexits) for RX. >> [ mst: code might become simpler if we add buffers >> immediately, but don't kick until later] >> [ markmc: possibly. batching this buffer allocation might be >> introducing more unpredictability to benchmarks too - i.e. there isn't >> a fixed per-packet overhead, some packets randomly have a higher overhead] >> on failure to allocate in atomic context we simply stop >> and try again on next recv packet. >> [mst: there's a fixme that this fails if we complete run out of >> buffers, should be handled by timer. could be a thread as well (allocate >> with GFP_KERNEL). >> idea: might be good for performance anyway. ] >> > > Yeah, this "batched packet add" is completely unscientific. The host will be > ignoring notifications anyway, so it shouldn't win anything AFAICT. Ditch it > and benchmark. > > >> After adding buffers, we do a kick. >> [ mst: test whether this optimization works: recv kicks should be >> rare ] Outstanding buffers are kept on recv linked list. >> [ mst: optimization idea: virtqueue already tracks >> posted buffers. Add flush operation and use that instead. ] >> > > Don't understand this comment? > > >> - recv is done with napi: on recv interrupt, disable interrupts >> poll until queue is empty, enable when it's empty >> [mst: test how well does this work. should get 1 interrupt per >> N packets. what is N?] >> > > It works if the guest is outpacing the host, but in practice I had trouble > getting above about 2:1. I've attached a spreadsheet showing the results of > various tests using lguest. You can see the last one "lguest:net-delay-for- > more-output.patch" where I actually inserted a silly 50 usec delay before > sending the receive interrupt: 47k irqs for 1M packets is great, too bad about > the latency :) > > >> [mst: idea: implement interrupt coalescing? ] >> > > lguest does this in the host, with mixed results. Here's the commentry from > my lguest:reduce_triggers-on-recv.patch (which is queued for linux-next as I > believe it's the right thing even though win is in the noise). > > lguest: try to batch interrupts on network receive > > Rather than triggering an interrupt every time, we only trigger an > interrupt when there are no more incoming packets (or the recv queue > is full). > > However, the overhead of doing the select to figure this out is > measurable: 1M pings goes from 98 to 104 seconds, and 1G Guest->Host > TCP goes from 3.69 to 3.94 seconds. It's close to the noise though. > > I tested various timeouts, including reducing it as the number of > pending packets increased, timing a 1 gigabyte TCP send from Guest -> > Host and Host -> Guest (GSO disabled, to increase packet rate). > > // time tcpblast -o -s 65536 -c 16k 192.168.2.1:9999 > /dev/null > > Timeout Guest->Host Pkts/irq Host->Guest Pkts/irq > Before 11.3s 1.0 6.3s 1.0 > 0 11.7s 1.0 6.6s 23.5 > 1 17.1s 8.8 8.6s 26.0 > 1/pending 13.4s 1.9 6.6s 23.8 > 2/pending 13.6s 2.8 6.6s 24.1 > 5/pending 14.1s 5.0 6.6s 24.4 > > >> [mst: some architectures (with expensive unaligned DMA) override >> NET_IP_ALIGN. since we don't really do DMA, we probably should use >> alignment of 2 always] >> > > That's unclear: what if the host is doing DMA? > > >> [ mst: question: there's a FIXME to avoid modulus in the math. >> since num is a power of 2, isn't this just & (num - 1)?] >> > > Exactly. > > >> Polling buffer: >> we look at vq index and use that to find the next completed buffer >> the pointer to data (skb) is retrieved and returned to user >> [ mst: clearing data is only needed for debugging. >> try removing this write - cache will be cleaner? ] >> > > It's our only way of detecting issues with hosts. We have reports of BAD_RING > being triggered (unf. not reproducible). > > >> TX: >> We poll for TX packets in 2 ways >> - On timer event (see below) >> - When we get a kick from guest >> At this point, we disable further notifications, >> and start a timer. Notifications are reenabled after this. >> This is designed to reduce the number of VMExits due to TX. >> [ markmc: tried removing the timer. >> It seems to really help some workloads. E.g. on RHEL: >> http://markmc.fedorapeople.org/virtio-netperf/2009-04-15/ >> on fedora removing timer has no major effect either way: >> >> http://markmc.fedorapeople.org/virtio-netperf/2008-11-06/g-h-tput-04-no-tx- >> timer.html ] >> > > lguest went fully multithreaded, dropped timer hack. Much nicer, and faster. > (See second point on graph). Timers are a hack because we're not async, so > fixing the real problem avoids that optimization guessing game entirely. > > >> Packets are polled from virtio ring, walking descriptor linked list. >> [ mst: optimize for completing in order? ] >> Packet addresses are converted to guest iovec, using >> cpu_physical_memory_map >> [ mst: cpu_physical_memory_map could be optimized >> to only handle what we actually use this for: >> single page in RAM ] >> > > Anthony had a patch for this IIRC. > > >> Interrupts will be reported to eventfd descriptors, and device will poll >> eventfd descriptors to get kicks from guest. >> > > This is definitely a win. AFAICT you can inject interrupts into the guest from > a separate thread today in KVM, too, so there's no core reason why devices > can't be completely async with this one change. > > Cheers, > Rusty. > > > _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/virtualization