Re: [PATCH v4 1/2] qemu_domain: NVLink2 bridge detection function for PPC64

Daniel Henrique Barboza <danielhb413@xxxxxxxxx> · Tue, 2 Apr 2019 17:27:55 -0300

On 4/2/19 5:34 AM, Erik Skultety wrote:
On Tue, Mar 12, 2019 at 06:55:49PM -0300, Daniel Henrique Barboza wrote:
The NVLink2 support in QEMU implements the detection of NVLink2
capable devices by verifying the attributes of the VFIO mem region
QEMU allocates for the NVIDIA GPUs. To properly allocate an
adequate amount of memLock, Libvirt needs this information before
a QEMU instance is even created, thus querying QEMU is not
possible and opening a VFIO window is too much.

An alternative is presented in this patch. Making the following
assumptions:

- if we want GPU RAM to be available in the guest, an NVLink2 bridge
must be passed through;

- an unknown PCI device can be classified as a NVLink2 bridge
if its device tree node has 'ibm,gpu', 'ibm,nvlink',
'ibm,nvlink-speed' and 'memory-region'.
Alexey mentioned that it should be enough to check for the properties ^above.
I'm just wondering, knowing this is IBM's PPC8/9 if the assumptions we have
made are going to stay with further revisions of PPC, NVLink and GPUs, IOW
we need to be sure "ibm,nvlink" won't be renamed with further revisions, e.g.
other cards than V100 in the future, because then compatibility and revision
selection comes into the picture.

Perhaps Alexey or Piotr can comment on this. I can't confirm that the
device node naming will remain as is in the long run.


This patch introduces a helper called @ppc64VFIODeviceIsNV2Bridge
that checks the device tree node of a given PCI device and
check if it meets the criteria to be a NVLink2 bridge. This
Just out of curiosity, what about NVLink 1.0? Apart from performance, I wasn't
able to find something useful in terms of compatibility, is there something to
consider, since we're only relying on NVLink 2.0?

NVLink1, as far as Libvirt goes, works similar to a regular GPU passthrough.
There is no changes in the memory topology in QEMU that needs extra
code to adjust rlimit.

I am not entirely sure if the code above will not generate a false-positive,
mistakenly detecting NV2 scenario for a NV1 GPU. I think the 'memory-region'
attribute won't be present in a NV1 bridge. Piotr, can you comment here?




new function will be used in a follow-up patch that, using the
first assumption, will set up the rlimits of the guest
accordingly.

Signed-off-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx>
---
  src/qemu/qemu_domain.c | 29 +++++++++++++++++++++++++++++
  1 file changed, 29 insertions(+)

diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index 1659e88478..dcc92d253c 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -10398,6 +10398,35 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
  }


+/**
+ * ppc64VFIODeviceIsNV2Bridge:
+ * @device: string with the PCI device address
+ *
+ * This function receives a string that represents a PCI device,
+ * such as '0004:04:00.0', and tells if the device is a NVLink2
+ * bridge.
+ */
+static bool
+ppc64VFIODeviceIsNV2Bridge(const char *device)
+{
+    const char *nvlink2Files[] = {"ibm,gpu", "ibm,nvlink",
+                                  "ibm,nvlink-speed", "memory-region"};
Like I said above, if ^this is not to change, then I'm okay in principle, still
I feel like this needs David's ACK (putting him on CC)

Codewise, Peter already provided you with comments.

Erik


+    char *file;
+    size_t i;
+
+    for (i = 0; i < 4; i++) {
+        if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/%s",
+                         device, nvlink2Files[i])) < 0)
+            return false;
+
+        if (!virFileExists(file))
+            return false;
+    }
+
+    return true;
+}
+
+
  /**
   * getPPC64MemLockLimitBytes:
   * @def: domain definition
--
2.20.1

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list