On Tue, Jul 27, 2021 at 11:22:25AM +0200, Francesc Guasch wrote: > Hello. > > I have a host with an NVIDIA RTX 3090. I configured PCI passthrough > and it works fine. We are using it for CUA and Matlab on Ubuntu 20.04. > > The problem comes sometimes on rebooting the virtual machine. It doesn't > happen 100% of the times but eventually after 3 or 4 reboots the PCI > device stops working. The only solution is to reboot the host. > > Weird thing is this only happens when rebooting the VM. After a host > reboot if we shutdown the virtual machine and we start it again, > it works fine. I wrote a small script that does that a hundred times > just to make sure. Only a reboot triggers the problem. > > When it fails I run "nvidia-smi" in the virtual machine and I get: > > No devices were found > > Also I spotted some errors in syslog > > NVRM: installed in this system is not supported by the > NVIDIA 460.91.03 driver release. > NVRM: GPU 0000:01:01.0: GPU has fallen off the bus > NVRM: the NVIDIA kernel module is unloaded. > NVRM: GPU 0000:01:01.0: RmInitAdapter failed! (0x23:0x65:1204) > NVRM: GPU 0000:01:01.0: rm_init_adapter failed, device minor number 0 > > The device is there because typing lspci I can see information: > > 0000:01:01.0 VGA compatible controller [0300]: NVIDIA Corporation > Device [10de:2204] (rev a1) > Subsystem: Gigabyte Technology Co., Ltd Device [1458:403b] > Kernel driver in use: nvidia > Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia > > I tried different Nvidia drivers and Linux kernels in the host and > the virtual machine with the same results. Hi, this question is better suited for vfio-users@xxxxxxxxxx. Once the GPU is bound to the vfio-pci driver, it's out of libvirt's hands. AFAIR Nvidia only enabled PCI device assignment on GeForce cards on Windows 10 VMs, but you claim to run a Linux VM. Back when I worked on the vGPU stuff that is supported only on the Tesla cards, I remember being told that the host and guest driver communicated with each other. Applying the same to GeForce, I would not be surprised if NVIDIA detected in the host driver that the corresponding guest driver is not a Windows 10 one and didn't do a proper GPU reset in between VM reboots - hence the need to reboot the host. There used to be a similar bus reset bug in the AMD host driver not so long ago which affected every single VM shutdown/reboot in a way that the host had to be rebooted in order for the card to be usable again. Be it as it may, I can only speculate and since your scenario is officially not supported by NVIDIA I wish you the best of luck :) Regards, Erik