On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote: > On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote: > > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote: > >> +static void ims_mask_irq(struct irq_data *data) > >> +{ > >> + struct msi_desc *desc = irq_data_get_msi_desc(data); > >> + struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem; > >> + u32 __iomem *ctrl = &slot->ctrl; > >> + > >> + iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl); > > > > Just to be clear, this is exactly the sort of operation we can't do > > with non-MSI interrupts. For a real PCI device to execute this it > > would have to keep the data on die. > > We means NVIDIA and your new device, right? We'd like to use this in the current Mellanox NIC HW, eg the mlx5 driver. (NVIDIA acquired Mellanox recently) > So if I understand correctly then the queue memory where the MSI > descriptor sits is in RAM. Yes, IMHO that is the whole point of this 'IMS' stuff. If devices could have enough on-die memory then they could just use really big MSI-X tables. Currently due to on-die memory constraints mlx5 is limited to a few hundred MSI-X vectors. Since MSI-X tables are exposed via MMIO they can't be 'swapped' to RAM. Moving away from MSI-X's MMIO access model allows them to be swapped to RAM. The cost is that accessing them for update is a command/response operation not a MMIO operation. The HW is already swapping the queues causing the interrupts to RAM, so adding a bit of additional data to store the MSI addr/data is reasonable. To give some sense, a 'working set' for the NIC device in some cases can be hundreds of megabytes of data. System RAM is used to store this, and precious on-die memory holds some dynamic active set, much like a processor cache. > How is that supposed to work if interrupt remapping is disabled? The best we can do is issue a command to the device and spin/sleep until completion. The device will serialize everything internally. If the device has died the driver has code to detect and trigger a PCI function reset which will definitely stop the interrupt. So, the implementation of these functions would be to push any change onto a command queue, trigger the device to DMA the command, spin/sleep until the device returns a response and then continue on. If the device doesn't return a response in a time window then trigger a WQ to do a full device reset. The spin/sleep is only needed if the update has to be synchronous, so things like rebalancing could just push the rebalancing work and immediately return. > If interrupt remapping is enabled then both are trivial because then the > irq chip can delegate everything to the parent chip, i.e. the remapping > unit. I did like this notion that IRQ remapping could avoid the overhead of spin/spleep. Most of the use cases we have for this will require the IOMMU anyhow. > > I saw the idxd driver was doing something like this, I assume it > > avoids trouble because it is a fake PCI device integrated with the > > CPU, not on a real PCI bus? > > That's how it is implemented as far as I understood the patches. It's > device memory therefore iowrite32(). I don't know anything about idxd.. Given the scale of interrupt need I assumed the idxd HW had some hidden swapping to RAM. Since it is on-die with the CPU there are a bunch of ways I could imagine Intel could make MMIO triggered swapping work that are not available to a true PCI-E device. Jason