Hi Greg, > From: Parav Pandit > Sent: Monday, June 3, 2024 12:41 PM > > Hi Greg, > > An emergency has turned up for Shay. So, he is going to be out of office for > few days. > I will attempt to answer below questions and spin v6 based on your > guidance. > > > From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> > > Sent: Thursday, May 30, 2024 8:21 PM > > > > On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote: > > > > > > > > > On 28/05/2024 20:57, Greg KH wrote: > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote: > > > > > Today, PCI PFs and VFs, which are anchored on the PCI bus, > > > > > display their IRQ information in the > > > > > <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions > > > > > (SFs) are similar to PFs and VFs and these SFs are anchored on > > > > > the auxiliary bus. However, these PCI SFs lack such IRQ > > > > > information on the auxiliary bus, leaving users without > > > > > visibility into which IRQs are used by the SFs. This absence > > > > > makes it impossible to debug situations and to understand the source > of interrupts/SFs for performance tuning and debug. > > > > > > > > Wait, again, this feels wrong. You should be able to walk back up > > > > the tree see the irq for the device, and vf, right? Why would the > > > > value be different down in the aux device? > > > > > > > > > you are correct, the IRQs of the aux device are subset of the IRQs > > > of the parent device. more detailed answer bellow. > > > > But isn't that already shown in /proc/interrupts? Why show it here as well? > > > Because /proc/interrupts cannot distinguish which irqs are for which SFs. > More below. > > > > > Does the msi irq somehow not actually show anywhere for the real > > > > pci > > device in sysfs at all today? > > > > > > > > What does sysfs look like today exactly for this information? > > > > > > > > > The IRQ information of all the children SFs of a PF is found in the > > > PF device as one single list. > > > There is no sane way to distinguish which IRQ is used by which SFs. > > > For example, in a machine with a single child SF of the PF 00:0b.0 > > > we currently have the bellow: > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/ > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 > > > 58 > > > > > > the above are IRQs of both the PF and its child SF. from here we > > > cannot tell which IRQ is used by the child SF. > > > > Does that matter? > Yes it matters. A user needs to see if a given IRQ is for SF or a PF and if it is for > SF, is it for which SF. > It needs to know this to set/modify the irq affinity via > /proc/irq/<irq_num>/affinity_files. > > > Seriously, what is userspace going to do with that that it can not > > already determine from /proc/interrupts/ or /proc/irq/? > > > User space cannot determine if a given IRQ is for SF or not from > /proc/interrupts and /proc/irq either. > Let me explain with example. > It cannot figure out for two reasons. > > 1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below. > mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0. > Format is: "purpose, event q number @ device_name". > Purpose = mlx5_compl (completions) > Event q number = 62 > Device = pci:0000:06.00.0. > > Can we do same for SFs? > For example, as, > mlx5_compl62@auxiliary:mlx5_core.sf.2 > > Will it work when SFs share the IRQ? No. it won't be accurate name when a > vector is shared among many SFs. > Also updating IRQ names dynamically to arbitrary long name won't be any > good either. > > 2. mlx5 do not use IRQF_SHARED. > This is the main reason due to which, /proc/irq/<irq_num>/devices_list is > missing SFs auxiliary device entries. > > We cannot share the NIC interrupt with any other arbitrary system device > because nic SF needs to have reasonably predictable latency. > > Shared interrupts also complicate the mlx5 driver flow even further as it can > fire right away. > And we want to avoid the already complex path further, with device recovery > flow in this area. > > Secondly, a while back also when team played with IRQF_SHARED there was > 8% or so perf drop too. > So mlx5 uses IRQ sharing among its own interrupt generators. > This is uniformly across PF, VF, SFs. > It also has certain good thresholds where it avoids sharing the IRQ and > allocates new dynamic one. > Thanks to Thomas's work. > > It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ subsystem > can offer APIs to restrict IRQs sharing to certain devices. > Sounds too complicated for the sake of showing SFs affinity. > > > I know it's /proc and not /sys but duplicating information might not > > be needed as I think you already have this information. > > Like I explained above, we don’t have that info. I think IRQF_SHARED > probably was missing piece to you previously. > PCI duplicates this info and widely used by irqbalance [1] which reads from > /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>. > > The useful part of irq files under sys is, a use space can directly see the irq > files under the desired device directly. > It would be extended for auxiliary bus SFs. > > [1] https://github.com/Irqbalance/irqbalance > > > If not, what is missing in /proc/irq today for this? > > > If IRQF_SHARED is used, nothing is missing. But we don’t find a good way to > use it. > > > > > And what about /proc/irq/ and /proc/interrupts/ doesn't that show > > > > you the needed information? Why are aux devices somehow "special" > > here? > > > > > > > > > /proc/interrupts interrupt name is some random driver choice. > > > > That's the same name you use here. > > > Yes, like I explained above. We can come up with some name for > /proc/interrupts. > But it won't tell which all SFs are sharing it, for example 64 SFs sharing a > vector won't be visible in some giant irq name in /proc/interrupts. > > So, to summarize, there are three places. > > 1. /proc/interrupts. > Limitations: > a. One line per interrupt does not fit the case. > b. interrupt name also not enough to show sharing SFs. > > 2. /proc/irq/<irq_number>/{device_list} > Limited only to IRQF_SHARED users. > > 3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name> > Pros: > a. Matches with pci PF and VFs > b. will be usable by irqbalance in similar way as PCI bus c. works regardless of > IRQF_SHARED used or not used. > d. Does not require complicated IRQ subsystem changes to limit > IRQF_SHARING to only a specific device type. > > > > There is > > > no sane naming convention and some small few bytes of upper limit of > > > what the IRQ name. > > > > That's up to the driver to name, so wouldn't it be the same here? > > > > > > > Additionally, the SFs are multifunctional devices supporting > > > > > RDMA, network devices, clocks, and more, similar to their peer > > > > > PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ > > > > > information available at the bus/device level. > > > > > > > > But it should be as part of the pci device, as that's where that > > > > information lives and is "bound" to, not the aux device on its own. > > > > > > > > > Auxiliary bus level SF device is using that IRQ too. So it is > > > additionally shown at auxiliary device level too. > > > > Your aux bus devices are using that, it's not generic to ALL aux bus devices. > > Along those lines, why not just put this info in the aux bus devices > > that your driver is bound to? > > > mlx5 driver uses the generic aux bus for SFs. > mlx5 driver can create some sysfs files directly into this SF device, however > we strongly want to avoid doing such vendor specific files based on our > experience in netdev and rdma subsystems. > > There are two main reasons to not put in mlx5 driver code. > > 1. In past several times it happened that a drivers tend to abuse this sysfs > interface to put any sort of debug information or other things and it becomes > hard to get rid of. > An example of this is fw_pages in > /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹ I cleaned a lot in rdma > now to avoid such abuse. > Want to avoid the same for the auxiliary bus too. > > 2. Intel sf driver is just getting started up. If mlx5 driver shows "irqs", and intel > driver will show this as "foo_irqs", the user will not have same experience > across two drivers. > For example, irqbalance [1] will not be able to work for these two vendors. > > So, it would be wiser to use aux bus extension for SF to show the directory, > like how pci bus does. > Vendor driver is just feeding as what irq is used by semi-virtual bus. > > > > > > To overcome the above limitations, this short series extends the > > > > > auxiliary bus to display IRQ information in sysfs, similar to > > > > > that of PFs and VFs. > > > > > > > > Again, examples of what it looks like today, and what it will look > > > > like with this patch set is needed in order to justify why this > > > > really is needed as it seems that the information should already > > > > be there > > for you. > > > > > > > > > full example: > > > in a machine with a single child SF of the PF 00:0b.0 we currently > > > have the bellow: > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/ > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 > > > 58 > > > > > > with this patch-set we will also have: > > > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/ > > > 50 51 52 53 54 55 56 57 58 > > > > > > which means that IRQs 50-58, which are shown on the PF "msi_irqs" > > > are used by the child SF. > > > > Please document this in the changelog if you wish to persist with it. > > > Sure. Its already part of the commit log of patch 2. > Will add the same example in cover letter too. > > > > > > It adds an 'irqs' directory under the auxiliary device and > > > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI > > > > > SF auxiliary devices share the IRQ with other SFs, a detail that > > > > > is also not available to the users. Consequently, this <irq_num> > > > > > file indicates whether the IRQ is 'exclusive' or 'shared'. This 'irqs' > > > > > directory extenstion is optional, i.e. only for PCI SFs the > > > > > sysfs irq > > information is optionally exposed. > > > > > > > > Why does userspace care about "shared" or not? What can they do > > > > with that, and why? > > > > > > > > > If IRQ is shared, userspace needs to take it into account when > > > looking at IRQ data and counters such as interrupts/sec. > > > Also, If IRQ is shared, user better not mess with the irq affinity > > > of such irq it as it can affect multiple SFs. > > > > But the logic of "shared" is at the irq level, not at the aux bus level. > > That's where that should be shown, not in a random bus subsystem, > > otherwise you would have to look across ALL busses to properly > > determine if an irq is shared or not, right? > > > Yes, that is a good point. Showing shared or exclusive irq in > /sys/irq/<irq_num>/device_name file is right think to do. > I agree with you. > > With that, are you ok, if we just add the > /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs used > by this SF? > And let user space deal to figure out sharing or exclusive. > > This will eliminate the global xarray in the patch-1 and "shared", "exclusive" > strings part of all the refcount code. > Did you get a chance to review above? Are you ok to add the sysfs files by the SF auxiliary bus driver without the "shared"/"exclusive" complexity?