Re: [patch RFC 38/38] irqchip: Add IMS array driver - NOT FOR MERGING

Jason Gunthorpe <jgg@xxxxxxxxxx> · Fri, 21 Aug 2020 17:17:05 -0300

On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote:
> On Fri, Aug 21 2020 at 09:45, Jason Gunthorpe wrote:
> > On Fri, Aug 21, 2020 at 02:25:02AM +0200, Thomas Gleixner wrote:
> >> +static void ims_mask_irq(struct irq_data *data)
> >> +{
> >> +	struct msi_desc *desc = irq_data_get_msi_desc(data);
> >> +	struct ims_array_slot __iomem *slot = desc->device_msi.priv_iomem;
> >> +	u32 __iomem *ctrl = &slot->ctrl;
> >> +
> >> +	iowrite32(ioread32(ctrl) & ~IMS_VECTOR_CTRL_UNMASK, ctrl);
> >
> > Just to be clear, this is exactly the sort of operation we can't do
> > with non-MSI interrupts. For a real PCI device to execute this it
> > would have to keep the data on die.
> 
> We means NVIDIA and your new device, right?

We'd like to use this in the current Mellanox NIC HW, eg the mlx5
driver. (NVIDIA acquired Mellanox recently)

> So if I understand correctly then the queue memory where the MSI
> descriptor sits is in RAM.

Yes, IMHO that is the whole point of this 'IMS' stuff. If devices
could have enough on-die memory then they could just use really big
MSI-X tables. Currently due to on-die memory constraints mlx5 is
limited to a few hundred MSI-X vectors.

Since MSI-X tables are exposed via MMIO they can't be 'swapped' to
RAM.

Moving away from MSI-X's MMIO access model allows them to be swapped
to RAM. The cost is that accessing them for update is a
command/response operation not a MMIO operation.

The HW is already swapping the queues causing the interrupts to RAM,
so adding a bit of additional data to store the MSI addr/data is
reasonable.

To give some sense, a 'working set' for the NIC device in some cases
can be hundreds of megabytes of data. System RAM is used to store
this, and precious on-die memory holds some dynamic active set, much
like a processor cache.

> How is that supposed to work if interrupt remapping is disabled?

The best we can do is issue a command to the device and spin/sleep
until completion. The device will serialize everything internally.

If the device has died the driver has code to detect and trigger a
PCI function reset which will definitely stop the interrupt.

So, the implementation of these functions would be to push any change
onto a command queue, trigger the device to DMA the command, spin/sleep
until the device returns a response and then continue on. If the
device doesn't return a response in a time window then trigger a WQ to
do a full device reset.

The spin/sleep is only needed if the update has to be synchronous, so
things like rebalancing could just push the rebalancing work and
immediately return.

> If interrupt remapping is enabled then both are trivial because then the
> irq chip can delegate everything to the parent chip, i.e. the remapping
> unit.

I did like this notion that IRQ remapping could avoid the overhead of
spin/spleep. Most of the use cases we have for this will require the
IOMMU anyhow.

> > I saw the idxd driver was doing something like this, I assume it
> > avoids trouble because it is a fake PCI device integrated with the
> > CPU, not on a real PCI bus?
> 
> That's how it is implemented as far as I understood the patches. It's
> device memory therefore iowrite32().

I don't know anything about idxd.. Given the scale of interrupt need I
assumed the idxd HW had some hidden swapping to RAM. 

Since it is on-die with the CPU there are a bunch of ways I could
imagine Intel could make MMIO triggered swapping work that are not
available to a true PCI-E device.

Jason