On 12/03/2019 20:39, Daniel Henrique Barboza wrote: > > > On 3/12/19 1:05 AM, Alexey Kardashevskiy wrote: >> >> On 12/03/2019 09:15, Daniel Henrique Barboza wrote: >>> >>> On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote: >>>> On 08/03/2019 04:51, Piotr Jaroszynski wrote: >>>>> On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote: >>>>>> On 3/7/19 12:15 PM, Erik Skultety wrote: >>>>>>> On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza >>>>>>> wrote: >>>>>>>> The NVLink2 support in QEMU implements the detection of NVLink2 >>>>>>>> capable devices by verfying the attributes of the VFIO mem region >>>>>>>> QEMU allocates for the NVIDIA GPUs. To properly allocate an >>>>>>>> adequate amount of memLock, Libvirt needs this information before >>>>>>>> a QEMU instance is even created. >>>>>>>> >>>>>>>> An alternative is presented in this patch. Given a PCI device, >>>>>>>> we'll traverse the device tree at /proc/device-tree to check if >>>>>>>> the device has a NPU bridge, retrieve the node of the NVLink2 bus, >>>>>>>> find the memory-node that is related to the bus and see if it's a >>>>>>>> NVLink2 bus by inspecting its 'reg' value. This logic is contained >>>>>>>> inside the 'device_is_nvlink2_capable' function, which uses other >>>>>>>> new helper functions to navigate and fetch values from the device >>>>>>>> tree nodes. >>>>>>>> >>>>>>>> Signed-off-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx> >>>>>>>> --- >>>>>>> This fails with a bunch of compilation errors, please make sure you >>>>>>> run make >>>>>>> check and make syntax-check on each patch of the series. >>>>>> Ooops, forgot to follow up make syntax-check with 'make' before >>>>>> submitting ... my bad. >>>>>> >>>>>>>> src/qemu/qemu_domain.c | 194 >>>>>>>> +++++++++++++++++++++++++++++++++++++++++ >>>>>>>> 1 file changed, 194 insertions(+) >>>>>>>> >>>>>>>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c >>>>>>>> index 77548c224c..97de5793e2 100644 >>>>>>>> --- a/src/qemu/qemu_domain.c >>>>>>>> +++ b/src/qemu/qemu_domain.c >>>>>>>> @@ -10343,6 +10343,200 @@ >>>>>>>> qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm) >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> +/** >>>>>>>> + * Reads a phandle file and returns the phandle value. >>>>>>>> + */ >>>>>>>> +static int >>>>>>>> +read_dt_phandle(const char* file) >>>>>>> ..we prefer camelCase naming of functions...all functions should >>>>>>> have a prefix, >>>>>>> e.g. vir,<driver>, etc. >>>>>>> >>>>>>>> +{ >>>>>>>> + unsigned int buf[1]; >>>>>>>> + size_t read; >>>>>>>> + FILE *f; >>>>>>>> + >>>>>>>> + f = fopen(file, "r"); >>>>>>>> + if (!f) >>>>>>>> + return -1; >>>>>>>> + >>>>>>>> + read = fread(buf, sizeof(unsigned int), 1, f); >>>>>>> We already have safe helpers that take care of such operations. >>>>>>> >>>>>>>> + >>>>>>>> + if (!read) { >>>>>>>> + VIR_CLOSE(f); >>>>>>> You could use VIR_AUTOCLOSE for this >>>>>>> >>>>>>>> + return 0; >>>>>>>> + } >>>>>>>> + >>>>>>>> + VIR_CLOSE(f); >>>>>>>> + return be32toh(buf[0]); >>>>>>> Is this part of gnulib? We need to make sure it's portable >>>>>>> otherwise we'd need >>>>>>> to make the conversion ourselves, just like for le64toh >>>>>>> >>>>>>>> +} >>>>>>>> + >>>>>>>> + >>>>>>>> +/** >>>>>>>> + * Reads a memory reg file and returns the first 4 int values. >>>>>>>> + * >>>>>>>> + * The caller is responsible for freeing the returned array. >>>>>>>> + */ >>>>>>>> +static unsigned int * >>>>>>>> +read_dt_memory_reg(const char *file) >>>>>>>> +{ >>>>>>>> + unsigned int *buf; >>>>>>>> + size_t read, i; >>>>>>>> + FILE *f; >>>>>>>> + >>>>>>>> + f = fopen(file, "r"); >>>>>>>> + if (!f) >>>>>>>> + return NULL; >>>>>>>> + >>>>>>>> + if (VIR_ALLOC_N(buf, 4) < 0) >>>>>>>> + return NULL; >>>>>>>> + >>>>>>>> + read = fread(buf, sizeof(unsigned int), 4, f); >>>>>>>> + >>>>>>>> + if (!read && read < 4) >>>>>>>> + /* shouldn't happen */ >>>>>>>> + VIR_FREE(buf); >>>>>>>> + else for (i = 0; i < 4; i++) >>>>>>>> + buf[i] = be32toh(buf[i]); >>>>>>>> + >>>>>>>> + VIR_CLOSE(f); >>>>>>>> + return buf; >>>>>>> Same comments as above... >>>>>>> >>>>>>>> +} >>>>>>>> + >>>>>>>> + >>>>>>>> +/** >>>>>>>> + * This wrapper function receives arguments to be used in a >>>>>>>> + * 'find' call to retrieve the file names that matches >>>>>>>> + * the criteria inside the /proc/device-tree dir. >>>>>>>> + * >>>>>>>> + * A 'find' call with '-iname phandle' inside /proc/device-tree >>>>>>>> + * provides more than a thousand matches. Adding '-path' to >>>>>>>> + * narrow it down further is necessary to keep the file >>>>>>>> + * listing sane. >>>>>>>> + * >>>>>>>> + * The caller is responsible to free the buffer returned by >>>>>>>> + * this function. >>>>>>>> + */ >>>>>>>> +static char * >>>>>>>> +retrieve_dt_files_pattern(const char *path_pattern, const char >>>>>>>> *file_pattern) >>>>>>>> +{ >>>>>>>> + virCommandPtr cmd = NULL; >>>>>>>> + char *output = NULL; >>>>>>>> + >>>>>>>> + cmd = virCommandNew("find"); >>>>>>>> + virCommandAddArgList(cmd, "/proc/device-tree/", "-path", >>>>>>>> path_pattern, >>>>>>>> + "-iname", file_pattern, NULL); >>>>>>>> + virCommandSetOutputBuffer(cmd, &output); >>>>>>>> + >>>>>>>> + if (virCommandRun(cmd, NULL) < 0) >>>>>>>> + VIR_FREE(output); >>>>>>>> + >>>>>>>> + virCommandFree(cmd); >>>>>>>> + return output; >>>>>>>> +} >>>>>>>> + >>>>>>>> + >>>>>>>> +/** >>>>>>>> + * Helper function that receives a listing of file names and >>>>>>>> + * calls read_dt_phandle() on each one finding for a match >>>>>>>> + * with the given phandle argument. Returns the file name if a >>>>>>>> + * match is found, NULL otherwise. >>>>>>>> + */ >>>>>>>> +static char * >>>>>>>> +find_dt_file_with_phandle(char *files, int phandle) >>>>>>>> +{ >>>>>>>> + char *line, *tmp; >>>>>>>> + int ret; >>>>>>>> + >>>>>>>> + line = strtok_r(files, "\n", &tmp); >>>>>>>> + do { >>>>>>>> + ret = read_dt_phandle(line); >>>>>>>> + if (ret == phandle) >>>>>>>> + break; >>>>>>>> + } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL); >>>>>>>> + >>>>>>>> + return line; >>>>>>>> +} >>>>>>>> + >>>>>>>> + >>>>>>>> +/** >>>>>>>> + * This function receives a string that represents a PCI device, >>>>>>>> + * such as '0004:04:00.0', and tells if the device is NVLink2 >>>>>>>> capable. >>>>>>>> + * >>>>>>>> + * The logic goes as follows: >>>>>>>> + * >>>>>>>> + * 1 - get the phandle of a nvlink of the device, reading the >>>>>>>> 'ibm,npu' >>>>>>>> + * attribute; >>>>>>>> + * 2 - find the device tree node of the nvlink bus using the >>>>>>>> phandle >>>>>>>> + * found in (1) >>>>>>>> + * 3 - get the phandle of the memory region of the nvlink bus >>>>>>>> + * 4 - find the device tree node of the memory region using the >>>>>>>> + * phandle found in (3) >>>>>>>> + * 5 - read the 'reg' value of the memory region. If the value of >>>>>>>> + * the second 64 bit value is 0x02 0x00, the device is attached >>>>>>>> + * to a NVLink2 bus. >>>>>>>> + * >>>>>>>> + * If any of these steps fails, the function returns false. >>>>>>>> + */ >>>>>>>> +static bool >>>>>>>> +device_is_nvlink2_capable(const char *device) >>>>>>>> +{ >>>>>>>> + char *file, *files, *tmp; >>>>>>>> + unsigned int *reg; >>>>>>>> + int phandle; >>>>>>>> + >>>>>>>> + if ((virAsprintf(&file, >>>>>>>> "/sys/bus/pci/devices/%s/of_node/ibm,npu", >>>>>>>> + device)) < 0) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> + /* Find phandles of nvlinks: */ >>>>>>>> + if ((phandle = read_dt_phandle(file)) == -1) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> + /* Find a DT node for the phandle found */ >>>>>>>> + files = retrieve_dt_files_pattern("*device-tree/pci*", >>>>>>>> "phandle"); >>>>>>>> + if (!files) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> + if ((file = find_dt_file_with_phandle(files, phandle)) == >>>>>>>> NULL) >>>>>>>> + goto fail; >>>>>>>> + >>>>>>>> + /* Find a phandle of the GPU memory region of the device. The >>>>>>>> + * file found above ends with '/phandle' - the memory region >>>>>>>> + * of the GPU ends with '/memory-region */ >>>>>>>> + tmp = strrchr(file, '/'); >>>>>>>> + *tmp = '\0'; >>>>>>>> + file = strcat(file, "/memory-region"); >>>>>>>> + >>>>>>>> + if ((phandle = read_dt_phandle(file)) == -1) >>>>>>>> + goto fail; >>>>>>>> + >>>>>>>> + file = NULL; >>>>>>>> + VIR_FREE(files); >>>>>>>> + >>>>>>>> + /* Find the memory node for the phandle found above */ >>>>>>>> + files = retrieve_dt_files_pattern("*device-tree/memory*", >>>>>>>> "phandle"); >>>>>>>> + if (!files) >>>>>>>> + return false; >>>>>>>> + >>>>>>>> + if ((file = find_dt_file_with_phandle(files, phandle)) == >>>>>>>> NULL) >>>>>>>> + goto fail; >>>>>>>> + >>>>>>>> + /* And see its size in the second 64bit value of 'reg'. First, >>>>>>>> + * the end of the file needs to be changed from '/phandle' to >>>>>>>> + * '/reg' */ >>>>>>>> + tmp = strrchr(file, '/'); >>>>>>>> + *tmp = '\0'; >>>>>>>> + file = strcat(file, "/reg"); >>>>>>>> + >>>>>>>> + reg = read_dt_memory_reg(file); >>>>>>>> + if (reg && reg[2] == 0x20 && reg[3] == 0x00) >>>>>>>> + return true; >>>>>>> This function is very complex just to find out whether a PCI device >>>>>>> is capable >>>>>>> of NVLink or not. Feels wrong to do it this way, I believe it would >>>>>>> be much >>>>>>> easier if NVIDIA exposed a sysfs attribute saying whether a PCI >>>>>>> device supports >>>>>>> NVLink so that our node-device driver would take care of this >>>>>>> during libvirtd >>>>>>> startup and then you'd only call a single API from the PPC64 helper >>>>>>> you're >>>>>>> introducing in the next patch to find out whether you need the >>>>>>> alternative >>>>>>> formula or not. >>>>>>> >>>>>>> Honestly, apart from the coding style issues, I don't like this >>>>>>> approach and >>>>>>> unless there's a really compelling reason for libvirt to do it in a >>>>>>> way which >>>>>>> involves spawning a 'find' process because of a complex pattern and >>>>>>> a bunch of >>>>>>> data necessary to filter out, I'd really suggest contacting NVIDIA >>>>>>> about this. >>>>>>> It's the same for mdevs, NVIDIA exposes a bunch of attributes in >>>>>>> sysfs which >>>>>>> we're able to read. >>>>>> I'll contact NVIDIA and see if there is an easier way (a sysfs >>>>>> attribute, for >>>>>> example) and, if doesn't, try to provide a plausibe reason to >>>>>> justify this >>>>>> detection code. >>>>> Sorry for the delay in responding. The problem is that all the V100 >>>>> GPUs >>>>> support NVLink, but it may or may not be connected up. This is >>>>> detected >>>>> at runtime during GPU initialization, which seems like much too >>>>> heavy of >>>>> an operation to perform as part of passthrough initialization. And >>>>> that's >>>>> why vfio-pci pieces rely on device tree information to figure it out. >>>>> >>>>> Alexey, would it be possible for vfio-pci to export the information >>>>> in a >>>>> way more friendly to libvirt? >>>> The only information needed here is whether a specific GPU has RAM or >>>> not. This can easily be found from the device-tree, imho quite friendly >>>> already. VFIO gets to know about these new capabilities when the VFIO >>>> PCI device is opened, and we rather want to avoid going that far in >>>> libvirt (open a VFIO container, attach a group, get a vfio-pci fd from >>>> it, enumerate regions - 2 PCI resets on the way, delays, meh). >>>> >>>> btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be >>>> passed too or the entire RAM thing won't work. "find" for the memory >>>> node can also be dropped really - if NVLink bridge OF node has >>>> "memory-region", then VFIO will most likely expose RAM and QEMU will >>>> try >>>> using it anyway. >>> >>> Hmmm I am not not sure I understood it completely ..... seeing the >>> of_nodes exposed in sysfs of one of the NVLink buses we have: >>> >>> >>> /sys/bus/pci/devices/0007:00:00.0/of_node$ ls >>> class-code ibm,gpu ibm,nvlink-speed memory-region reg >>> device-id ibm,loc-code ibm,pci-config-space-type >>> name revision-id >>> ibm,device-tgt-addr ibm,nvlink interrupts phandle vendor-id >>> >>> >>> We can make a safe assumption that the V100 GPU will always be passed >>> through with at least one NVLink2 bus. >> Assumption #1: if we want GPU RAM to be available in the guest, an >> NVLink bridge must be passed through. No bridge - no RAM - no rlimit >> adjustment. >> >> >>> How many hops do we need to assert >>> that a given device is a NVLink2 bus from its of_node info? >>> >>> For example, can we say something like: >>> >>> "The device node of this device has ibm,gpu and ibm,nvlink and >>> ibm,nvlink-speed >>> and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not >> >> ibm,nvlink-speed can be different (I saw 9), you cannot rely on this. >> >> >>> possible to pass >>> through the bus alone, there is a V100 GPU in this same IOMMU group. So >>> this is >>> a NVLink2 passthrough scenario" >>> >>> >>> Or perhaps: >>> >>> "It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has >>> a reg >>> value of 0x20 , thus this is a nvlink2 bus and ....." >> >> 0x20 is a size of the GPU RAM window which might change in different >> revisions of the hw as well. >> >>> Both alternatives are way simpler than what I'm doing in this patch. I >>> am not sure >>> if they are valid though. >> >> Assumption #2: if the nvlink2 bridge's DT node has memory-region, then >> the GPU will try to use its RAM. The adjustment you'll have to make >> won't depend on anything as the existing GPU RAM placement is so that >> whatever you can possibly throw at QEMU will fit 128TiB window and we >> cannot make the window smaller anyway (next size would lower power of >> two - 64TiB - and GPU RAM lives just above that). > > Ok, so, from the of_node of an unknown device that has been > passed through to a VM, if the DT node has: > > ibm,gpu > ibm,nvlink > memory-region > > > Is that enough to justify the rlimit adjustment for NVLink2? This is the idea, yes. To be on a safe side, you might want to limit the check to the IBM vendor ID (not even sure about the device id) and PCI bridges only. > > >> >> I could do better of course and allow adjusting the limit to cover let's >> say 66TiB instead of 128TiB, it just seems to be unnecessary >> complication (we already allocate only required amount of TCE entries on >> demand). >> >> >> >>>>>>> Thinking about it a bit more, since this is NVIDIA-specific, >>>>>>> having an >>>>>>> NVIDIA-only sysfs attribute doesn't help the node-device driver I >>>>>>> mentioned >>>>>>> above. In general, we try to avoid introducing vendor-specific >>>>>>> code, so nodedev >>>>>>> might not be the right place to check for the attribute (because >>>>>>> it's not >>>>>>> universal and would require vendor-specific elements), but I think >>>>>>> we could >>>>>>> have an NVLink helper reading this sysfs attribute(s) elsewhere in >>>>>>> libvirt >>>>>>> codebase. >>>>>>> >>>>>>> Erik >> >> > -- Alexey -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list