Re: [PATCH] virtio_ring: Shadow available ring flags & index

"Xie, Huawei" <huawei.xie@xxxxxxxxx> · Tue, 17 Nov 2015 03:46:24 +0000



On 11/14/2015 7:41 AM, Venkatesh Srinivas wrote:
> On Wed, Nov 11, 2015 at 02:34:33PM +0200, Michael S. Tsirkin wrote:
>> On Tue, Nov 10, 2015 at 04:21:07PM -0800, Venkatesh Srinivas wrote:
>>> Improves cacheline transfer flow of available ring header.
>>>
>>> Virtqueues are implemented as a pair of rings, one producer->consumer
>>> avail ring and one consumer->producer used ring; preceding the
>>> avail ring in memory are two contiguous u16 fields -- avail->flags
>>> and avail->idx. A producer posts work by writing to avail->idx and
>>> a consumer reads avail->idx.
>>>
>>> The flags and idx fields only need to be written by a producer CPU
>>> and only read by a consumer CPU; when the producer and consumer are
>>> running on different CPUs and the virtio_ring code is structured to
>>> only have source writes/sink reads, we can continuously transfer the
>>> avail header cacheline between 'M' states between cores. This flow
>>> optimizes core -> core bandwidth on certain CPUs.
>>>
>>> (see: "Software Optimization Guide for AMD Family 15h Processors",
>>> Section 11.6; similar language appears in the 10h guide and should
>>> apply to CPUs w/ exclusive caches, using LLC as a transfer cache)
>>>
>>> Unfortunately the existing virtio_ring code issued reads to the
>>> avail->idx and read-modify-writes to avail->flags on the producer.
>>>
>>> This change shadows the flags and index fields in producer memory;
>>> the vring code now reads from the shadows and only ever writes to
>>> avail->flags and avail->idx, allowing the cacheline to transfer
>>> core -> core optimally.
>> Sounds logical, I'll apply this after a  bit of testing
>> of my own, thanks!
> Thanks!
Venkatesh:
Is it that your patch only applies to CPUs w/ exclusive caches? Do you
have perf data on Intel CPUs?
For the perf metric you provide, why not L1-dcache-load-misses which is
more meaning full?
>
>>> In a concurrent version of vring_bench, the time required for
>>> 10,000,000 buffer checkout/returns was reduced by ~2% (average
>>> across many runs) on an AMD Piledriver (15h) CPU:
>>>
>>> (w/o shadowing):
>>>  Performance counter stats for './vring_bench':
>>>      5,451,082,016      L1-dcache-loads
>>>      ...
>>>        2.221477739 seconds time elapsed
>>>
>>> (w/ shadowing):
>>>  Performance counter stats for './vring_bench':
>>>      5,405,701,361      L1-dcache-loads
>>>      ...
>>>        2.168405376 seconds time elapsed
>> Could you supply the full command line you used
>> to test this?
> Yes --
>
> perf stat -e L1-dcache-loads,L1-dcache-load-misses,L1-dcache-stores \
> 	./vring_bench
>
> The standard version of vring_bench is single-threaded (posted on this list
> but never submitted); my tests were with a version that has a worker thread
> polling the VQ. How should I share it? Should I just attach it to an email
> here?
>
>>> The further away (in a NUMA sense) virtio producers and consumers are
>>> from each other, the more we expect to benefit. Physical implementations
>>> of virtio devices and implementations of virtio where the consumer polls
>>> vring avail indexes (vhost) should also benefit.
>>>
>>> Signed-off-by: Venkatesh Srinivas <venkateshs@xxxxxxxxxx>
>> Here's a similar patch for the ring itself:
>> https://lkml.org/lkml/2015/9/10/111
>>
>> Does it help you as well?
> I tested your patch in our environment; our virtqueues do not support
> Indirect entries and your patch does not manage to elide many writes, so I
> do not see a performance difference. In an environment with Indirect, your
> patch will likely be a win.
>
> (My patch gets most of its win by eliminating reads on the producer; when
> the producer reads avail fields at the same time the consumer is polling,
> we see cacheline transfers that hurt performance. Your patch eliminates
> writes, which is nice, but our tests w/ polling are not as sensitive to
> writes from the producer.)
>
> I have two quick comments on your patch --
> 1) I think you need to kfree vq->avail when deleting the virtqueue.
>
> 2) Should we avoid allocating a cache for virtqueues that are not
>    performance critical? (ex: virtio-scsi eventq/controlq, virtio-net
>    controlq)
>
> Should I post comments in reply to the original patch email (given that it
> is ~2 months old)?
>
> Thanks!
> -- vs;
>

_______________________________________________
Virtualization mailing list
Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/virtualization