> "Michael S. Tsirkin" <mst@xxxxxxxxxx> > > > What other shared TX/RX locks are there? In your setup, is the same > > > macvtap socket structure used for RX and TX? If yes this will create > > > cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line, > > > there might also be contention on the lock in sk_sleep waitqueue. > > > Anything else? > > > > The patch is not introducing any locking (both vhost and virtio-net). > > The single stream drop is due to different vhost threads handling the > > RX/TX traffic. > > > > I added a heuristic (fuzzy) to determine if more than one flow > > is being used on the device, and if not, use vhost[0] for both > > tx and rx (vhost_poll_queue figures this out before waking up > > the suitable vhost thread). Testing shows that single stream > > performance is as good as the original code. > > ... > > > This approach works nicely for both single and multiple stream. > > Does this look good? > > > > Thanks, > > > > - KK > > Yes, but I guess it depends on the heuristic :) What's the logic? I define how recently a txq was used. If 0 or 1 txq's were used recently, use vq[0] (which also handles rx). Otherwise, use multiple txq (vq[1-n]). The code is: /* * Algorithm for selecting vq: * * Condition Return * RX vq vq[0] * If all txqs unused vq[0] * If one txq used, and new txq is same vq[0] * If one txq used, and new txq is different vq[vq->qnum] * If > 1 txqs used vq[vq->qnum] * Where "used" means the txq was used in the last 'n' jiffies. * * Note: locking is not required as an update race will only result in * a different worker being woken up. */ static inline struct vhost_virtqueue *vhost_find_vq(struct vhost_poll *poll) { if (poll->vq->qnum) { struct vhost_dev *dev = poll->vq->dev; struct vhost_virtqueue *vq = &dev->vqs[0]; unsigned long max_time = jiffies - 5; /* Some macro needed */ unsigned long *table = dev->jiffies; int i, used = 0; for (i = 0; i < dev->nvqs - 1; i++) { if (time_after_eq(table[i], max_time) && ++used > 1) { vq = poll->vq; break; } } table[poll->vq->qnum - 1] = jiffies; return vq; } /* RX is handled by the same worker thread */ return poll->vq; } void vhost_poll_queue(struct vhost_poll *poll) { struct vhost_virtqueue *vq = vhost_find_vq(poll); vhost_work_queue(vq, &poll->work); } Since poll batches packets, find_vq does not seem to add much to the CPU utilization (or BW). I am sure that code can be optimized much better. The results I sent in my last mail were without your use_mm patch, and the only tuning was to make vhost threads run on only cpus 0-3 (though the performance is good even without that). I will test it later today with the use_mm patch too. Thanks, - KK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html