On Sun, 5 Jan 2025 17:36:15 +0000 <ankita@xxxxxxxxxx> wrote: > From: Ankit Agrawal <ankita@xxxxxxxxxx> > > In contrast to Grace Hopper systems, the HBM training has been moved > out of the UEFI on the Grace Blackwell systems. This reduces the system > bootup time significantly. > > The onus of checking whether the HBM training has completed thus falls > on the module. > > The HBM training status can be determined from a BAR0 register. > Similarly, another BAR0 register exposes the status of the CPU-GPU > chip-to-chip (C2C) cache coherent interconnect. > > Based on testing, 30s is determined to be sufficient to ensure > initialization completion on all the Grace based systems. Thus poll > these register and check for 30s. If the HBM training is not complete > or if the C2C link is not ready, fail the probe. > > While the time is not required on Grace Hopper systems, it is > beneficial to make the check to ensure the device is in an > expected state. Hence keeping it generalized to both the generations. > > Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx> > --- > drivers/vfio/pci/nvgrace-gpu/main.c | 53 +++++++++++++++++++++++++++++ > 1 file changed, 53 insertions(+) > > diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c > index 44a276c886e1..cf020496743e 100644 > --- a/drivers/vfio/pci/nvgrace-gpu/main.c > +++ b/drivers/vfio/pci/nvgrace-gpu/main.c > @@ -5,6 +5,7 @@ > > #include <linux/sizes.h> > #include <linux/vfio_pci_core.h> > +#include <linux/delay.h> > > /* > * The device memory usable to the workloads running in the VM is cached > @@ -28,6 +29,13 @@ > > #define GPU_CAP_DVSEC_REGISTER 3 > > +#define C2C_LINK_BAR0_OFFSET 0x1498 > +#define HBM_TRAINING_BAR0_OFFSET 0x200BC > +#define STATUS_READY 0xFF > + > +#define POLL_QUANTUM_MS 1000 > +#define POLL_TIMEOUT_MS (30 * 1000) > + > /* > * The state of the two device memory region - resmem and usemem - is > * saved as struct mem_region. > @@ -848,6 +856,47 @@ static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev) > return false; > } > > +/* > + * To reduce the system bootup time, the HBM training has > + * been moved out of the UEFI on the Grace-Blackwell systems. > + * > + * The onus of checking whether the HBM training has completed > + * thus falls on the module. The HBM training status can be > + * determined from a BAR0 register. > + * > + * Similarly, another BAR0 register exposes the status of the > + * CPU-GPU chip-to-chip (C2C) cache coherent interconnect. > + * > + * Poll these register and check for 30s. If the HBM training is > + * not complete or if the C2C link is not ready, fail the probe. > + * > + * While the wait is not required on Grace Hopper systems, it > + * is beneficial to make the check to ensure the device is in an > + * expected state. > + */ > +static int nvgrace_gpu_check_device_status(struct pci_dev *pdev) "nvgrace_gpu_wait_device_ready()"? > +{ > + void __iomem *io; > + int time_elasped; > + > + io = pci_iomap(pdev, 0, ~0UL); The documentation is unclear here, but existing code suggests passing 0 here rather than -1 to map the full BAR. It ends up being equivalent since the code doesn't error attempting to map longer than the BAR, but there's no reason to add a bad example. > + if (!io) > + return -ENOMEM; > + > + for (time_elasped = 0; time_elasped < POLL_TIMEOUT_MS; > + time_elasped += POLL_QUANTUM_MS) { > + if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) && > + (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) { > + pci_iounmap(pdev, io); > + return 0; > + } > + msleep(POLL_QUANTUM_MS); > + } time_after() would simplify things here. I'd also suggest a common exit path. > + > + pci_iounmap(pdev, io); > + return -ENODEV; ETIME could work for the error code too. Thanks, Alex > +} > + > static int nvgrace_gpu_probe(struct pci_dev *pdev, > const struct pci_device_id *id) > { > @@ -856,6 +905,10 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev, > u64 memphys, memlength; > int ret; > > + ret = nvgrace_gpu_check_device_status(pdev); > + if (ret) > + return ret; > + > ret = nvgrace_gpu_fetch_memory_property(pdev, &memphys, &memlength); > if (!ret) > ops = &nvgrace_gpu_pci_ops;