Hi Greg, > From: Parav Pandit > Sent: Thursday, June 6, 2024 8:13 AM > > Hi Greg, > > > > From: Parav Pandit > > Sent: Monday, June 3, 2024 12:41 PM > > > > Hi Greg, > > > > An emergency has turned up for Shay. So, he is going to be out of > > office for few days. > > I will attempt to answer below questions and spin v6 based on your > > guidance. > > > > > From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> > > > Sent: Thursday, May 30, 2024 8:21 PM > > > > > > On Wed, May 29, 2024 at 09:19:13AM +0300, Shay Drori wrote: > > > > > > > > > > > > On 28/05/2024 20:57, Greg KH wrote: > > > > > External email: Use caution opening links or attachments > > > > > > > > > > > > > > > On Tue, May 28, 2024 at 12:11:42PM +0300, Shay Drory wrote: > > > > > > Today, PCI PFs and VFs, which are anchored on the PCI bus, > > > > > > display their IRQ information in the > > > > > > <pci_device>/msi_irqs/<irq_num> sysfs files. PCI subfunctions > > > > > > (SFs) are similar to PFs and VFs and these SFs are anchored on > > > > > > the auxiliary bus. However, these PCI SFs lack such IRQ > > > > > > information on the auxiliary bus, leaving users without > > > > > > visibility into which IRQs are used by the SFs. This absence > > > > > > makes it impossible to debug situations and to understand the > > > > > > source > > of interrupts/SFs for performance tuning and debug. > > > > > > > > > > Wait, again, this feels wrong. You should be able to walk back > > > > > up the tree see the irq for the device, and vf, right? Why > > > > > would the value be different down in the aux device? > > > > > > > > > > > > you are correct, the IRQs of the aux device are subset of the IRQs > > > > of the parent device. more detailed answer bellow. > > > > > > But isn't that already shown in /proc/interrupts? Why show it here as > well? > > > > > Because /proc/interrupts cannot distinguish which irqs are for which SFs. > > More below. > > > > > > > Does the msi irq somehow not actually show anywhere for the real > > > > > pci > > > device in sysfs at all today? > > > > > > > > > > What does sysfs look like today exactly for this information? > > > > > > > > > > > > The IRQ information of all the children SFs of a PF is found in > > > > the PF device as one single list. > > > > There is no sane way to distinguish which IRQ is used by which SFs. > > > > For example, in a machine with a single child SF of the PF 00:0b.0 > > > > we currently have the bellow: > > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/ > > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 > > > > 58 > > > > > > > > the above are IRQs of both the PF and its child SF. from here we > > > > cannot tell which IRQ is used by the child SF. > > > > > > Does that matter? > > Yes it matters. A user needs to see if a given IRQ is for SF or a PF > > and if it is for SF, is it for which SF. > > It needs to know this to set/modify the irq affinity via > > /proc/irq/<irq_num>/affinity_files. > > > > > Seriously, what is userspace going to do with that that it can not > > > already determine from /proc/interrupts/ or /proc/irq/? > > > > > User space cannot determine if a given IRQ is for SF or not from > > /proc/interrupts and /proc/irq either. > > Let me explain with example. > > It cannot figure out for two reasons. > > > > 1. For a PF or a VF, the interrupt name in /proc/interrupts looks like below. > > mlx5 VF device writes irq name as: mlx5_comp62@pci:0000:06:00.0. > > Format is: "purpose, event q number @ device_name". > > Purpose = mlx5_compl (completions) > > Event q number = 62 > > Device = pci:0000:06.00.0. > > > > Can we do same for SFs? > > For example, as, > > mlx5_compl62@auxiliary:mlx5_core.sf.2 > > > > Will it work when SFs share the IRQ? No. it won't be accurate name > > when a vector is shared among many SFs. > > Also updating IRQ names dynamically to arbitrary long name won't be > > any good either. > > > > 2. mlx5 do not use IRQF_SHARED. > > This is the main reason due to which, /proc/irq/<irq_num>/devices_list > > is missing SFs auxiliary device entries. > > > > We cannot share the NIC interrupt with any other arbitrary system > > device because nic SF needs to have reasonably predictable latency. > > > > Shared interrupts also complicate the mlx5 driver flow even further as > > it can fire right away. > > And we want to avoid the already complex path further, with device > > recovery flow in this area. > > > > Secondly, a while back also when team played with IRQF_SHARED there > > was 8% or so perf drop too. > > So mlx5 uses IRQ sharing among its own interrupt generators. > > This is uniformly across PF, VF, SFs. > > It also has certain good thresholds where it avoids sharing the IRQ > > and allocates new dynamic one. > > Thanks to Thomas's work. > > > > It wouldn’t be wise to use IRQF_SHARED for mlx5 NICs, unless IRQ > > subsystem can offer APIs to restrict IRQs sharing to certain devices. > > Sounds too complicated for the sake of showing SFs affinity. > > > > > I know it's /proc and not /sys but duplicating information might not > > > be needed as I think you already have this information. > > > > Like I explained above, we don’t have that info. I think IRQF_SHARED > > probably was missing piece to you previously. > > PCI duplicates this info and widely used by irqbalance [1] which reads > > from /sys/bus/pci/devices/<dev_name>/msi_irqs/<all_irqs file>. > > > > The useful part of irq files under sys is, a use space can directly > > see the irq files under the desired device directly. > > It would be extended for auxiliary bus SFs. > > > > [1] https://github.com/Irqbalance/irqbalance > > > > > If not, what is missing in /proc/irq today for this? > > > > > If IRQF_SHARED is used, nothing is missing. But we don’t find a good > > way to use it. > > > > > > > And what about /proc/irq/ and /proc/interrupts/ doesn't that > > > > > show you the needed information? Why are aux devices somehow > "special" > > > here? > > > > > > > > > > > > /proc/interrupts interrupt name is some random driver choice. > > > > > > That's the same name you use here. > > > > > Yes, like I explained above. We can come up with some name for > > /proc/interrupts. > > But it won't tell which all SFs are sharing it, for example 64 SFs > > sharing a vector won't be visible in some giant irq name in /proc/interrupts. > > > > So, to summarize, there are three places. > > > > 1. /proc/interrupts. > > Limitations: > > a. One line per interrupt does not fit the case. > > b. interrupt name also not enough to show sharing SFs. > > > > 2. /proc/irq/<irq_number>/{device_list} > > Limited only to IRQF_SHARED users. > > > > 3. /sys/bus/auxiliary/devices/mlx5_core.sf.2/irqs<irq_file_name> > > Pros: > > a. Matches with pci PF and VFs > > b. will be usable by irqbalance in similar way as PCI bus c. works > > regardless of IRQF_SHARED used or not used. > > d. Does not require complicated IRQ subsystem changes to limit > > IRQF_SHARING to only a specific device type. > > > > > > There is > > > > no sane naming convention and some small few bytes of upper limit > > > > of what the IRQ name. > > > > > > That's up to the driver to name, so wouldn't it be the same here? > > > > > > > > > Additionally, the SFs are multifunctional devices supporting > > > > > > RDMA, network devices, clocks, and more, similar to their peer > > > > > > PCI PFs and VFs. Therefore, it is desirable to have SFs' IRQ > > > > > > information available at the bus/device level. > > > > > > > > > > But it should be as part of the pci device, as that's where that > > > > > information lives and is "bound" to, not the aux device on its own. > > > > > > > > > > > > Auxiliary bus level SF device is using that IRQ too. So it is > > > > additionally shown at auxiliary device level too. > > > > > > Your aux bus devices are using that, it's not generic to ALL aux bus devices. > > > Along those lines, why not just put this info in the aux bus devices > > > that your driver is bound to? > > > > > mlx5 driver uses the generic aux bus for SFs. > > mlx5 driver can create some sysfs files directly into this SF device, > > however we strongly want to avoid doing such vendor specific files > > based on our experience in netdev and rdma subsystems. > > > > There are two main reasons to not put in mlx5 driver code. > > > > 1. In past several times it happened that a drivers tend to abuse this > > sysfs interface to put any sort of debug information or other things > > and it becomes hard to get rid of. > > An example of this is fw_pages in > > /sys/class/infiniband/mlx5_0/device/fw_pages file. ☹ I cleaned a lot > > in rdma now to avoid such abuse. > > Want to avoid the same for the auxiliary bus too. > > > > 2. Intel sf driver is just getting started up. If mlx5 driver shows > > "irqs", and intel driver will show this as "foo_irqs", the user will > > not have same experience across two drivers. > > For example, irqbalance [1] will not be able to work for these two vendors. > > > > So, it would be wiser to use aux bus extension for SF to show the > > directory, like how pci bus does. > > Vendor driver is just feeding as what irq is used by semi-virtual bus. > > > > > > > > To overcome the above limitations, this short series extends > > > > > > the auxiliary bus to display IRQ information in sysfs, similar > > > > > > to that of PFs and VFs. > > > > > > > > > > Again, examples of what it looks like today, and what it will > > > > > look like with this patch set is needed in order to justify why > > > > > this really is needed as it seems that the information should > > > > > already be there > > > for you. > > > > > > > > > > > > full example: > > > > in a machine with a single child SF of the PF 00:0b.0 we currently > > > > have the bellow: > > > > $ ls /sys/bus/pci/devices/0000:00:0b.0/msi_irqs/ > > > > 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 > > > > 58 > > > > > > > > with this patch-set we will also have: > > > > $ ls /sys/bus/pci/devices/0000\:00\:0b.0/mlx5_core.sf.1/irqs/ > > > > 50 51 52 53 54 55 56 57 58 > > > > > > > > which means that IRQs 50-58, which are shown on the PF "msi_irqs" > > > > are used by the child SF. > > > > > > Please document this in the changelog if you wish to persist with it. > > > > > Sure. Its already part of the commit log of patch 2. > > Will add the same example in cover letter too. > > > > > > > > It adds an 'irqs' directory under the auxiliary device and > > > > > > includes an <irq_num> sysfs file within it. Sometimes, the PCI > > > > > > SF auxiliary devices share the IRQ with other SFs, a detail > > > > > > that is also not available to the users. Consequently, this > > > > > > <irq_num> file indicates whether the IRQ is 'exclusive' or 'shared'. This > 'irqs' > > > > > > directory extenstion is optional, i.e. only for PCI SFs the > > > > > > sysfs irq > > > information is optionally exposed. > > > > > > > > > > Why does userspace care about "shared" or not? What can they do > > > > > with that, and why? > > > > > > > > > > > > If IRQ is shared, userspace needs to take it into account when > > > > looking at IRQ data and counters such as interrupts/sec. > > > > Also, If IRQ is shared, user better not mess with the irq affinity > > > > of such irq it as it can affect multiple SFs. > > > > > > But the logic of "shared" is at the irq level, not at the aux bus level. > > > That's where that should be shown, not in a random bus subsystem, > > > otherwise you would have to look across ALL busses to properly > > > determine if an irq is shared or not, right? > > > > > Yes, that is a good point. Showing shared or exclusive irq in > > /sys/irq/<irq_num>/device_name file is right think to do. > > I agree with you. > > > > > With that, are you ok, if we just add the > > /sys/bus/auxiliary/devices/mlx5_core.sf.2/<irq_num> files to list IRQs > > used by this SF? > > And let user space deal to figure out sharing or exclusive. > > > > This will eliminate the global xarray in the patch-1 and "shared", "exclusive" > > strings part of all the refcount code. > > > > Did you get a chance to review above? > Are you ok to add the sysfs files by the SF auxiliary bus driver without the > "shared"/"exclusive" complexity? Can you please suggest? Are you ok with this approach?