Re: [PATCH] virtio_ring: Shadow available ring flags & index

Venkatesh Srinivas via Virtualization <virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 17 Nov 2015 20:08:18 -0800

On Mon, Nov 16, 2015 at 7:46 PM, Xie, Huawei <huawei.xie@xxxxxxxxx> wrote:

> On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote:
> > On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin wrote:
> >> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas wrote:
> >>> Improves cacheline transfer flow of available ring header.
> >>>
> >>> Virtqueues are implemented as a pair of rings, one producer->consumer
> >>> avail ring and one consumer->producer used ring; preceding the
> >>> avail ring in memory are two contiguous u16 fields -- avail->flags
> >>> and avail->idx. A producer posts work by writing to avail->idx and
> >>> a consumer reads avail->idx.
> >>>
> >>> The flags and idx fields only need to be written by a producer CPU
> >>> and only read by a consumer CPU; when the producer and consumer are
> >>> running on different CPUs and the virtio_ring code is structured to
> >>> only have source writes/sink reads, we can continuously transfer the
> >>> avail header cacheline between 'M' states between cores. This flow
> >>> optimizes core -> core bandwidth on certain CPUs.
> >>>
> >>> (see: "Software Optimization Guide for AMD Family 15h Processors",
> >>> Section 11.6; similar language appears in the 10h guide and should
> >>> apply to CPUs w/ exclusive caches, using LLC as a transfer cache)
> >>>
> >>> Unfortunately the existing virtio_ring code issued reads to the
> >>> avail->idx and read-modify-writes to avail->flags on the producer.
> >>>
> >>> This change shadows the flags and index fields in producer memory;
> >>> the vring code now reads from the shadows and only ever writes to
> >>> avail->flags and avail->idx, allowing the cacheline to transfer
> >>> core -> core optimally.
> >> Sounds logical, I'll apply this after a  bit of testing
> >> of my own, thanks!
> > Thanks!
>

> Venkatesh:
> Is it that your patch only applies to CPUs w/ exclusive caches?

No --- it applies when the inter-cache coherence flow is optimized by
'M' -> 'M' transfers and when producer reads might interfere w/
consumer prefetchw/reads. The AMD Optimization guides have specific
language on this subject, but other platforms may benefit.
(see Intel #'s below)

> Do you have perf data on Intel CPUs?

Good idea -- I ran some tests on a couple of Intel platforms:

(these are perf data from sample runs; for each I ran many runs, the
 numbers were pretty stable except for Haswell-EP cross-socket)

One-socket Intel Xeon W3690 ("Westmere"), 3.46 GHz; core turbo disabled
=======================================================================
(note -- w/ core turbo disabled, performance is _very_ stable; variance of
 < 0.5% run-to-run; figure of merit is "seconds elapsed" here)

* Producer / consumer bound to Hyperthread pairs:

 Performance counter stats for './vring_bench_noshadow 1000000000':

 343,425,166,916 L1-dcache-loads
      21,393,148 L1-dcache-load-misses     #    0.01% of all L1-dcache hits
  61,709,640,363 L1-dcache-stores
       5,745,690 L1-dcache-store-misses
  10,186,932,553 L1-dcache-prefetches
           1,491 L1-dcache-prefetch-misses
   121.335699344 seconds time elapsed

 Performance counter stats for './vring_bench_shadow 1000000000':

 334,766,413,861 L1-dcache-loads
      15,787,778 L1-dcache-load-misses     #    0.00% of all L1-dcache hits
  62,735,792,799 L1-dcache-stores
       3,252,113 L1-dcache-store-misses
   9,018,273,596 L1-dcache-prefetches
             819 L1-dcache-prefetch-misses
   121.206339656 seconds time elapsed