Am 19.04.24 um 17:19 schrieb Ilpo Järvinen:
On Thu, 18 Apr 2024, Dag B wrote:
On 18.04.2024 14:24, Christian König wrote:
Am 18.04.24 um 12:42 schrieb Dag B:
[SNIP]
Is there a good ELI13 resource explaining how resizable BAR works in
Linux?
My current kernel command-line contains: pci=assign-busses,realloc
That's a really really bad idea. The "assign-busses" flag was introduced
to get 20year old laptops to see their cardbus PCI devices.
I threw a lot of mud at the wall to see what stuck. Removing it now did
not make a big difference.
Removing realloc prevents the second TB3 GPU from being initialized, so
keeping that for now.
That's really interesting. Why does it fail without that?
It basically means that your BIOS is somehow broken and only the Linux PCI
subsystem is able to assign resources correctly.
Please provide the output of "sudo lspci -v" and "sudo lspci -tv" as file
attachment (*not* inline in a mail!).
In case I have expressed myself awkwardly, the realloc=off case appears to
make the device driver have issues with the second GPU.
I have attached both outputs, for realloc=off.
Not knowing what is considered acceptable message sizes on this m/l, I
uploaded the same for realloc=on, as well as output from dmesg for both cases
to:
https://github.com/dagbdagb/p53
If the m/l has mechanisms to archive attachments and replace them with links,
I'll redo the exercise in a follow-up email. I understand the value of having
the 'context' of the discussion readily available in one place.
The mem BAR & bridge window configuration is identical between
realloc=on/off.
The error seems to relate to io BAR:
[ 2.782439] nvidia 0000:09:00.0: BAR 5 [io 0x0000-0x007f]: not claimed; can't enable device
[ 2.783139] NVRM: pci_enable_device failed, aborting
With realloc=on, the entire IO window is disabled for the bridges and for
some reason nvidia doesn't abort in that case.
That actually makes a lot of sense.
At least on AMD hardware the IO window is only used for VGA emulation
and I strongly suspect it's the same on the NVIDIA GPUs.
So what basically happens is that the BIOS for some reason enables the
IO range on both GPUs while when Linux makes the re-alloc it disables
the ranges. Most likely because the Linux PCI code knows that they
should only be used if this device is the primary VGA device used during
boot.
Now when pci_enable_device() is called the function checks if all
enabled BARs actually have resources and without realloc=on the I/O BAR
has nothing allocated and the function fails. While with realloc=on the
BAR is disabled.
Well, what a mess. @Dag I would just strongly suggest to see if you can
update the BIOS. What happens here is clearly incorrect.
Regarding the resizing as far as I can see the BIOS allocates only a
single 1GiB window to the upstream bridge, that is most likely way to
small for anything than the default 256MiB BAR.
Maybe try to force assign more address space to this bridge. IIRC one of
the kernel parameters could be used for that, but of hand I don't
remember the syntax.
Regards,
Christian.