Re: Passthrough device memory throughput issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2018-05-18 07:00, Alex Williamson wrote:
Hi Geoff,

On Wed, 16 May 2018 20:45:22 +1000
geoff@xxxxxxxxxxxxxxx wrote:

Hi All,

I have been working on making LookingGlass (LG)
(https://github.com/gnif/LookingGlass) viable to use inside a Linux
guest for Windows -> Linux frame streaming. Currently LG works very well
when running native on the host streaming frames from a Windows guest
via an IVSHMEM device, however when the client is running inside a Linux
guest we seem to be hitting a memory performance wall.

To check my understanding, a Windows VM with assigned GPU is scraping
frames and writing them into the shared memory BAR of an ivshmem
device.  If the client is the host Linux, we can read frames out of the
shared memory space fast enough, but if the client is Linux in another
VM then we hit a performance issue in reading the frames out of the
shared memory BAR.  In the former case, the client is reading from the
shared memory area directly, while in the latter it's through the
ivshmem device.

Absolutely correct


Before I continue here is my hardware configuration:

   ThreadRipper 1950X in NUMA mode.
   GeForce 1080Ti passed through to a Windows guest
   AMD Vega 56 passed through to a Linux guest

Both Windows and Linux guests are bound to the same node as are their
memory allocations. Memory copy performance in the Linux guest matches
native memory copy performance at ~40GB/s. Windows copy performance is
slower seeming to hit a wall at ~14GB/s, slower then I would have
expected but thats is a separate issue, suffice to say it's plenty fast
enough for what I am trying to accomplish here.

Copy performance here is between two buffers in VM RAM though, right?
The performance issue is copies from an MMIO BAR, which should have
some sort of uncacheable mapping.  I guess we assume write performance
to the MMIO BAR is not the issue as that's static between the test
cases.

Again, correct. Windows Guest IVSHMEM BAR -> Linux Guest IVSHMEM BAR.


Windows is feeding captured frames at 1920x1200 @ 32bpp into the IVSHMEM
virtual device, which the Linux guest is using as it's input. The data
transfer rate is matching that above of ~14GB/s, allowing for in theory
over 1,600 frames per second. But when I take this buffer and try to
feed it to the AMD Vega, I see an abysmal transfer rate of ~131MB/s (~
15fps). Copying the shared memory into an intermediate buffer before
feeding the data to the GPU doesn't make a difference.

What does "feeding" from ivshmem to GPU look like here?  Is this a user
process in the Linux VM mmap'ing both the ivshmem MMIO BAR and the GPU
BAR and doing a memcpy() directly between them?  Is there a driver in
the Linux guest for the ivshmem device or are we just enabling MMIO in
the command register and mmap'ing the resource file via sysfs?

It is a user space application that is perfoming a mmap of the sysfs resource, but to eliminate the possibility of incorrect caching I wrote a small UIO kernel module also, which made no difference.

See: https://github.com/gnif/LookingGlass/tree/master/module


Now one might thing the rendering code is at fault here (and it might
be), however if I instead dynamically create the buffer for each frame I do not see any performance issues, for example, the below will generate
a vertical repeating gradient that is horizontally scrolling.

static int offset = 0;
char * data = malloc(frameSize);
for(int i = 0; i < frameSize; ++i)
   data[i] = i + offset;
++offset;
render(data, frameSize);
free(data);

This is trying to prove that the copy to the GPU in the Linux guest is
not at fault, right?  It seems though that that path is the same
regardless of whether the client is in the host or another guest, the
GPU is assigned in the latter case, but there's plenty of evidence that
we can use a GPU at near native performance in a guest already.  So,
despite the subject, I'm not sure how the passthrough/assigned device
plays a factor here, I think the question is around Linux reading data
out of the ivshmem MMIO BAR.  Is that actually affected by the presence
of an assigned device in the client VM?  We should be able to test
reading out of the ivshmem device regardless of an assigned device.

I must have been 1/2 asleep when I initially tested this, it is indeed very slow to read from the ivshmem device itself, this is not GPU related at all.


Is there something I am overlooking with regards to this buffer transfer
Could DMA be rejecting the buffer because it's not in system ram but a
memory mapped BAR? and if so, why doesn't copying it to an intermediate
buffer first help?

If DMA were the issue, I'd expect "work" vs "not work", not "slow" vs
"fast".  I'm a bit confused how an intermediate buffer could improve
the situation here, your evidence suggests that we can write data out
to the client GPU at sufficient speed, it's getting data out of the
ivshmem BAR that's the problem.  We still need to fill an intermediate
buffer from the ivshmem BAR and we should be able to measure the
performance of that independent of ultimately writing the data into
the GPU on the client.  In fact, I'm not really sure where any DMA is
occurring, afaict we're not issuing a command to ivshmem asking it to
write some range of shared memory to system RAM, we're reading it
directly from the MMIO space of the device.  That's just regular
programmed I/O from the CPU governed by the CPUs memory attributes for
the given address space.  Of course it would be really bad if these
PIOs trapped out to QEMU, have you verified we're not hitting
memory_region_dispatch_read() while the client is accessing the MMIO
BAR?  Thanks,

No, I am sorry but I am fairly new to the world of x86 hardware at this level, much of this I am learning as I go.


Alex



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux