RE: [PATCH net-next v5 0/2] Introduce auxiliary bus IRQs sysfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Greg,


> From: Parav Pandit
> Sent: Monday, June 3, 2024 12:41 PM
> 
> Hi Greg,
> 
> An emergency has turned up for Shay. So, he is going to be out of office for
> few days.
> I will attempt to answer below questions and spin v6 based on your
> guidance.
> 
> > From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> > Sent: Thursday, May 30, 2024 8:21 PM
> >
> > On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote:
> > >
> > >
> > > On 28/05/2024 20:57, Greg KH wrote:
> > > > External email: Use caution opening links or attachments
> > > >
> > > >
> > > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote:
> > > > > Today, PCI PFs and VFs, which are anchored on the PCI bus,
> > > > > display their IRQ information in the
> > > > > <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions
> > > > > (SFs) are similar to PFs and VFs and these SFs are anchored on
> > > > > the auxiliary bus. However, these PCI SFs lack such IRQ
> > > > > information on the auxiliary bus, leaving users without
> > > > > visibility into which IRQs are used by the SFs. This absence
> > > > > makes it impossible to debug situations and to understand the source
> of interrupts/SFs for performance tuning and debug.
> > > >
> > > > Wait, again, this feels wrong.  You should be able to walk back up
> > > > the tree see the irq for the device, and vf, right?  Why would the
> > > > value be different down in the aux device?
> > >
> > >
> > > you are correct, the IRQs of the aux device are subset of the IRQs
> > > of the parent device. more detailed answer bellow.
> >
> > But isn't that already shown in /proc/interrupts?  Why show it here as well?
> >
> Because /proc/interrupts cannot distinguish which irqs are for which SFs.
> More below.
> 
> > > > Does the msi irq somehow not actually show anywhere for the real
> > > > pci
> > device in sysfs at all today?
> > > >
> > > > What does sysfs look like today exactly for this information?
> > >
> > >
> > > The IRQ information of all the children SFs of a PF is found in the
> > > PF device as one single list.
> > > There is no sane way to distinguish which IRQ is used by which SFs.
> > > For example, in a machine with a single child SF of the PF 00:0b.0
> > > we currently have the bellow:
> > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57
> > > 58
> > >
> > > the above are IRQs of both the PF and its child SF. from here we
> > > cannot tell which IRQ is used by the child SF.
> >
> > Does that matter?
> Yes it matters. A user needs to see if a given IRQ is for SF or a PF and if it is for
> SF, is it for which SF.
> It needs to know this to set/modify the irq affinity via
> /proc/irq/<irq_num>/affinity_files.
> 
> > Seriously, what is userspace going to do with that that it can not
> > already determine from /proc/interrupts/ or /proc/irq/?
> >
> User space cannot determine if a given IRQ is for SF or not from
> /proc/interrupts and /proc/irq either.
> Let me explain with example.
> It cannot figure out for two reasons.
> 
> 1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below.
> mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0.
> Format is: "purpose, event q number @ device_name".
> Purpose = mlx5_compl (completions)
> Event q number = 62
> Device = pci:0000:06.00.0.
> 
> Can we do same for SFs?
> For example, as,
> mlx5_compl62@auxiliary:mlx5_core.sf.2
> 
> Will it work when SFs share the IRQ? No. it won't be accurate name when a
> vector is shared among many SFs.
> Also updating IRQ names dynamically to arbitrary long name won't be any
> good either.
> 
> 2. mlx5 do not use IRQF_SHARED.
> This is the main reason due to which, /proc/irq/<irq_num>/devices_list is
> missing SFs auxiliary device entries.
> 
> We cannot share the NIC interrupt with any other arbitrary system device
> because nic SF needs to have reasonably predictable latency.
> 
> Shared interrupts also complicate the mlx5 driver flow even further as it can
> fire right away.
> And we want to avoid the already complex path further, with device recovery
> flow in this area.
> 
> Secondly, a while back also when team played with IRQF_SHARED there was
> 8% or so perf drop too.
> So mlx5 uses IRQ sharing among its own interrupt generators.
> This is uniformly across PF, VF, SFs.
> It also has certain good thresholds where it avoids sharing the IRQ and
> allocates new dynamic one.
> Thanks to Thomas's work.
> 
> It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ subsystem
> can offer APIs to restrict IRQs sharing to certain devices.
> Sounds too complicated for the sake of showing SFs affinity.
> 
> > I know it's /proc and not /sys but duplicating information might not
> > be needed as I think you already have this information.
> 
> Like I explained above, we don’t have that info. I think IRQF_SHARED
> probably was missing piece to you previously.
> PCI duplicates this info and widely used by irqbalance [1] which reads from
> /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>.
> 
> The useful part of irq files under sys is, a use space can directly see the irq
> files under the desired device directly.
> It would be extended for auxiliary bus SFs.
> 
> [1] https://github.com/Irqbalance/irqbalance
> 
> > If not, what is missing in /proc/irq today for this?
> >
> If IRQF_SHARED is used, nothing is missing. But we don’t find a good way to
> use it.
> 
> > > > And what about /proc/irq/ and /proc/interrupts/ doesn't that show
> > > > you the needed information?  Why are aux devices somehow "special"
> > here?
> > >
> > >
> > > /proc/interrupts interrupt name is some random driver choice.
> >
> > That's the same name you use here.
> >
> Yes, like I explained above. We can come up with some name for
> /proc/interrupts.
> But it won't tell which all SFs are sharing it, for example 64 SFs sharing a
> vector won't be visible in some giant irq name in /proc/interrupts.
> 
> So, to summarize, there are three places.
> 
> 1. /proc/interrupts.
> Limitations:
> a. One line per interrupt does not fit the case.
> b. interrupt name also not enough to show sharing SFs.
> 
> 2. /proc/irq/<irq_number>/{device_list}
> Limited only to IRQF_SHARED users.
> 
> 3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name>
> Pros:
> a. Matches with pci PF and VFs
> b. will be usable by irqbalance in similar way as PCI bus c. works regardless of
> IRQF_SHARED used or not used.
> d. Does not require complicated IRQ subsystem changes to limit
> IRQF_SHARING to only a specific device type.
> 
> > > There is
> > > no sane naming convention and some small few bytes of upper limit of
> > > what the IRQ name.
> >
> > That's up to the driver to name, so wouldn't it be the same here?
> >
> > > > > Additionally, the SFs are multifunctional devices supporting
> > > > > RDMA, network devices, clocks, and more, similar to their peer
> > > > > PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ
> > > > > information available at the bus/device level.
> > > >
> > > > But it should be as part of the pci device, as that's where that
> > > > information lives and is "bound" to, not the aux device on its own.
> > >
> > >
> > > Auxiliary bus level SF device is using that IRQ too. So it is
> > > additionally shown at auxiliary device level too.
> >
> > Your aux bus devices are using that, it's not generic to ALL aux bus devices.
> > Along those lines, why not just put this info in the aux bus devices
> > that your driver is bound to?
> >
> mlx5 driver uses the generic aux bus for SFs.
> mlx5 driver can create some sysfs files directly into this SF device, however
> we strongly want to avoid doing such vendor specific files based on our
> experience in netdev and rdma subsystems.
> 
> There are two main reasons to not put in mlx5 driver code.
> 
> 1. In past several times it happened that a drivers tend to abuse this sysfs
> interface to put any sort of debug information or other things and it becomes
> hard to get rid of.
> An example of this is fw_pages in
> /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹ I cleaned a lot in rdma
> now to avoid such abuse.
> Want to avoid the same for the auxiliary bus too.
> 
> 2. Intel sf driver is just getting started up. If mlx5 driver shows "irqs", and intel
> driver will show this as "foo_irqs", the user will not have same experience
> across two drivers.
> For example, irqbalance [1] will not be able to work for these two vendors.
> 
> So, it would be wiser to use aux bus extension for SF to show the directory,
> like how pci bus does.
> Vendor driver is just feeding as what irq is used by semi-virtual bus.
> 
> > > > > To overcome the above limitations, this short series extends the
> > > > > auxiliary bus to display IRQ information in sysfs, similar to
> > > > > that of PFs and VFs.
> > > >
> > > > Again, examples of what it looks like today, and what it will look
> > > > like with this patch set is needed in order to justify why this
> > > > really is needed as it seems that the information should already
> > > > be there
> > for you.
> > >
> > >
> > > full example:
> > > in a machine with a single child SF of the PF 00:0b.0 we currently
> > > have the bellow:
> > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/
> > > 41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57
> > > 58
> > >
> > > with this patch-set we will also have:
> > > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/
> > > 50  51  52  53  54  55  56  57  58
> > >
> > > which means that IRQs 50-58, which are shown on the PF "msi_irqs"
> > > are used by the child SF.
> >
> > Please document this in the changelog if you wish to persist with it.
> >
> Sure. Its already part of the commit log of patch 2.
> Will add the same example in cover letter too.
> 
> > > > > It adds an 'irqs' directory under the auxiliary device and
> > > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI
> > > > > SF auxiliary devices share the IRQ with other SFs, a detail that
> > > > > is also not available to the users. Consequently, this <irq_num>
> > > > > file indicates whether the IRQ is 'exclusive' or 'shared'.  This 'irqs'
> > > > > directory extenstion is optional, i.e. only for PCI SFs the
> > > > > sysfs irq
> > information is optionally exposed.
> > > >
> > > > Why does userspace care about "shared" or not?  What can they do
> > > > with that, and why?
> > >
> > >
> > > If IRQ is shared, userspace needs to take it into account when
> > > looking at IRQ data and counters such as interrupts/sec.
> > > Also, If IRQ is shared, user better not mess with the irq affinity
> > > of such irq it as it can affect multiple SFs.
> >
> > But the logic of "shared" is at the irq level, not at the aux bus level.
> > That's where that should be shown, not in a random bus subsystem,
> > otherwise you would have to look across ALL busses to properly
> > determine if an irq is shared or not, right?
> >
> Yes, that is a good point. Showing shared or exclusive irq in
> /sys/irq/<irq_num>/device_name file is right think to do.
> I agree with you.
> 

> With that, are you ok, if we just add the
> /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs used
> by this SF?
> And let user space deal to figure out sharing or exclusive.
> 
> This will eliminate the global xarray in the patch-1 and "shared", "exclusive"
> strings part of all the refcount code.
>

Did you get a chance to review above?
Are you ok to add the sysfs files by the SF auxiliary bus driver without the "shared"/"exclusive" complexity?




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux