On Wed, Sep 27, 2017 at 10:04:18AM +0800, Jason Wang wrote: > > > On 2017年09月27日 03:25, Michael S. Tsirkin wrote: > > On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote: > > > This patch implements basic batched processing of tx virtqueue by > > > prefetching desc indices and updating used ring in a batch. For > > > non-zerocopy case, vq->heads were used for storing the prefetched > > > indices and updating used ring. It is also a requirement for doing > > > more batching on top. For zerocopy case and for simplicity, batched > > > processing were simply disabled by only fetching and processing one > > > descriptor at a time, this could be optimized in the future. > > > > > > XDP_DROP (without touching skb) on tun (with Moongen in guest) with > > > zercopy disabled: > > > > > > Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz: > > > Before: 3.20Mpps > > > After: 3.90Mpps (+22%) > > > > > > No differences were seen with zerocopy enabled. > > > > > > Signed-off-by: Jason Wang <jasowang@xxxxxxxxxx> > > So where is the speedup coming from? I'd guess the ring is > > hot in cache, it's faster to access it in one go, then > > pass many packets to net stack. Is that right? > > > > Another possibility is better code cache locality. > > Yes, I think the speed up comes from: > > - less cache misses > - less cache line bounce when virtqueue is about to be full (guest is faster > than host which is the case of MoonGen) > - less memory barriers > - possible faster copy speed by using copy_to_user() on modern CPUs > > > > > So how about this patchset is refactored: > > > > 1. use existing APIs just first get packets then > > transmit them all then use them all > > Looks like current API can not get packets first, it only support get packet > one by one (if you mean vhost_get_vq_desc()). And used ring updating may get > more misses in this case. Right. So if you do for (...) vhost_get_vq_desc then later for (...) vhost_add_used then you get most of benefits except maybe code cache misses and copy_to_user. > > 2. add new APIs and move the loop into vhost core > > for more speedups > > I don't see any advantages, looks like just need some e.g callbacks in this > case. > > Thanks IUC callbacks pretty much destroy the code cache locality advantages, IP is jumping around too much. -- MST