Re: [PATCH v5] PCI: vmd: Add the module param to adjust MSI mode

Xinghui Li <korantwork@xxxxxxxxx> · Mon, 8 May 2023 21:01:31 +0800



On Sat, May 6, 2023 at 12:08 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
>
> > I am fine with these two ways naming of the param. Adjusting from
> > enable_msi_remaping to disable_msi_bypass was aimed to trying address
> > your comment about dealing with the device not supporting bypass.
> > And in vmd drivers, the vmd bypass feature is enabled by adding the flag
> > "VMD_FEAT_CAN_BYPASS_MSI_REMAP".  Therefore, I think disabling
> > bypass seems more appropriate. This patch aims to provide a convenient
> > way to remove that flag in a specific case.
>
> Users don't care about the name of VMD_FEAT_CAN_BYPASS_MSI_REMAP.  I
> don't think that's a very good name either (in my opinion
> "VMD_FEAT_MSI_REMAP_DISABLE" would be more descriptive, and
> VMCONFIG_MSI_REMAP is even worse since setting it *disables* MSI
> remapping), but in any case these are internal to the driver.
>
> > On Sat, Apr 29, 2023 at 2:40 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > > The "disable_msi_bypass" parameter name also leads to some complicated
> > > logic.  IIUC, "disable_msi_bypass=0" means "do not disable MSI remap
> > > bypassing" or, in other words, "do not remap MSIs."  This is only
> > > supported by some VMDs.  Using "disable_msi_bypass=0" to *enable* the
> > > bypass feature is confusing.
> >
> > However, as you said, it does lead to some complicated logic.  So, I
> > also believe that these two approaches have their own pros and cons.
> >
> > > I still don't understand what causes the performance problem here.  I
> > > guess you see higher performance when the VMD remaps child MSIs?  So
> > > adding the VMD MSI-X domain interrupt handler and squashing all the
> > > child MSI vectors into the VMD MSI vector space makes things better?
> > > That seems backwards.  Is this because of cache effects or something?
> >
> > > What does "excessive pressure on the PCIe node" mean?  I assume the
> > > PCIe node means the VMD?  It receives the same number of child
> > > interrupts in either case.
> >
> > What I mean is that there will be performance issues when a PCIe domain
> > is fully loaded with 4 Gen4 NVMe devices.  like this:
> >  +-[10002:00]-+-01.0-[01]----00.0  device0
> >  |                     +-03.0-[02]----00.0  device1
> >  |                     +-05.0-[03]----00.0  device2
> >  |                      \-07.0-[04]----00.0  device3
> >
> > According to the perf/irqtop tool, we found the os does get the same
> > counts of interrupts in different modes. However, when the above
> > situation occurs, we observed a significant increase in CPU idle
> > time. Besides, the data and performance when using the bypass VMD
> > feature were identical to when VMD was disabled. And after the
> > devices mounted on a domain are reduced, the IOPS of individual
> > devices will rebound somewhat. Therefore, we speculate that VMD can
> > play a role in balancing and buffering interrupt loads. Therefore,
> > in this situation, we believe that VMD ought to not be bypassed to
> > handle MSI.
>
> The proposed text was:
>
>   Use this when multiple NVMe devices are mounted on the same PCIe
>   node with a high volume of 4K random I/O. It mitigates excessive
>   pressure on the PCIe node caused by numerous interrupts from NVMe
>   drives, resulting in improved I/O performance. Such as:
>
> The NVMe configuration and workload you mentioned works better with
> MSI-X remapping.  But I don't know *why*, and I don't know that NVMe
> is the only device affected.  So it's hard to write useful guidance
> for users, other than "sometimes it helps."
>
> Straw man proposal:
>
>   msi_remap=0
>
>     Disable VMD MSI-X remapping, which improves interrupt performance
>     because child device interrupts avoid the VMD MSI-X domain
>     interrupt handler.  Not all VMD devices support this, but for
>     those that do, this is the default.
>
>   msi_remap=1
>
>     Remap child MSI-X interrupts into VMD MSI-X interrupts.  This
>     limits the number of MSI-X vectors available to the whole child
>     device domain to the number of VMD MSI-X interrupts.
>
>     This may be required in some virtualization scenarios.
>
>     This may improve performance in some I/O topologies, e.g., several
>     NVMe devices doing lots of random I/O, but we don't know why.
>
> I hate the idea of "we don't know why."  If you *do* know why, it
> would be much better to outline the mechanism because that would help
> users know when to use this.  But if we don't know why, admitting that
> straight out is better than hand-waving about excessive pressure, etc.
>
I completely agree with you. I discovered this issue using the bisect method.
Based on the observed data and experiments, I drew the above conclusions.
We deduced the cause from the observed results. However, I have not yet
been able to determine the precise cause of this issue.

> The same applies to the virtualization caveat.  The text above is not
> actionable -- how do users know whether their particular
> virtualization scenario requires this option?
>
I will add a note to make it clear that bypass mode will not work in guest
passthrought mode, only remapping mode can be used.

Thanks