Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100

Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> · Tue, 3 Dec 2024 14:33:10 -0600

Thanks.

I'm thinking about the cleanest way to accomplish this:

1. I'm wondering if replacing the pci_info() calls with equivalent
printk_deferred() calls might be sufficient here. This works in my
initial test, but I'm not sure if this is definitive proof that we
wouldn't have any issues in all deployments, or if my configuration is
just not impacted by this kind of deadlock.

2. I did also draft a patch that would just eliminate the redundancy
and disable the impacted logs by default, and allow them to be
re-enabled with a new kernel command line option
"pci=bar_logging_enabled" (at the cost of the performance gains due to
reduced redundancy). This works well in all of my tests.

Do you think either of those approaches would work / be appropriate?
Ultimately I am trying to avoid messy changes that would require
actually propagating all of the info needed for these logs back up to
pci_read_bases(), if at all possible, since there seems like no
obvious way to do that without changing the signature of
__pci_read_base() or tracking additional state.

-Mitchell Augustin

On Tue, Dec 3, 2024 at 1:20 PM Alex Williamson
<alex.williamson@xxxxxxxxxx> wrote:
>
> On Mon, 2 Dec 2024 13:36:25 -0600
> Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> wrote:
>
> > Thanks!
> >
> > This approach makes sense to me - the only concern I have is that I
> > see this restriction in a comment in __pci_read_base():
> >
> > `/* No printks while decoding is disabled! */`
> >
> > At the end of __pci_read_base(), we do have several pci_info() and
> > pci_err() calls - so I think we would need to also print that info one
> > level up after the new decode enable if we do decide to move decode
> > disable/enable one level up. Let me know if you agree, or if there is
> > a more straightforward alternative that I am missing.
>
> Nope, I agree.  The console might be the device we've disabled for
> sizing or might be downstream of that device.  The logging would need
> to be deferred until the device is enabled.  Thanks,
>
> Alex
>
> > On Wed, Nov 27, 2024 at 11:22 AM Alex Williamson
> > <alex.williamson@xxxxxxxxxx> wrote:
> > >
> > > On Tue, 26 Nov 2024 19:12:35 -0600
> > > Mitchell Augustin <mitchell.augustin@xxxxxxxxxxxxx> wrote:
> > >
> > > > Thanks for the breakdown!
> > > >
> > > > > That alone calls __pci_read_base() three separate times, each time
> > > > > disabling and re-enabling decode on the bridge. [...] So we're
> > > > > really being bitten that we toggle decode-enable/memory enable
> > > > > around reading each BAR size
> > > >
> > > > That makes sense to me. Is this something that could theoretically be
> > > > done in a less redundant way, or is there some functional limitation
> > > > that would prevent that or make it inadvisable? (I'm still new to pci
> > > > subsystem debugging, so apologies if that's a bit vague.)
> > >
> > > The only requirement is that decode should be disabled while sizing
> > > BARs, the fact that we repeat it around each BAR is, I think, just the
> > > way the code is structured.  It doesn't take into account that toggling
> > > the command register bit is not a trivial operation in a virtualized
> > > environment.  IMO we should push the command register manipulation up a
> > > layer so that we only toggle it once per device rather than once per
> > > BAR.  Thanks,
> > >
> > > Alex
> > >
> >
> >
>

-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering