On Fri, Aug 21 2020 at 17:17, Jason Gunthorpe wrote: > On Fri, Aug 21, 2020 at 09:47:43PM +0200, Thomas Gleixner wrote: >> So if I understand correctly then the queue memory where the MSI >> descriptor sits is in RAM. > > Yes, IMHO that is the whole point of this 'IMS' stuff. If devices > could have enough on-die memory then they could just use really big > MSI-X tables. Currently due to on-die memory constraints mlx5 is > limited to a few hundred MSI-X vectors. Right, that's the limit of a particular device, but nothing prevents you to have a larger table on a new device. The MSI-X limitation to 2048 is defined by the PCI spec and you'd need either some non spec compliant abuse of the reserved size bits or some extra config entry. So IMS is a way to work around that. But I understand why you want to move them to main memory, but you have to deal with the problems this creates upfront. > Since MSI-X tables are exposed via MMIO they can't be 'swapped' to > RAM. > > Moving away from MSI-X's MMIO access model allows them to be swapped > to RAM. The cost is that accessing them for update is a > command/response operation not a MMIO operation. > > The HW is already swapping the queues causing the interrupts to RAM, > so adding a bit of additional data to store the MSI addr/data is > reasonable. Makes sense. >> How is that supposed to work if interrupt remapping is disabled? > > The best we can do is issue a command to the device and spin/sleep > until completion. The device will serialize everything internally. > > If the device has died the driver has code to detect and trigger a > PCI function reset which will definitely stop the interrupt. If that interrupt is gone into storm mode for some reason then this will render your machine unusable before you can do that. > So, the implementation of these functions would be to push any change > onto a command queue, trigger the device to DMA the command, spin/sleep > until the device returns a response and then continue on. If the > device doesn't return a response in a time window then trigger a WQ to > do a full device reset. I really don't want to do that with the irq descriptor lock held or in case of affinity from the interrupt handler as we have to do with PCI MSI/MSI-X due to the horrors of the X86 interrupt delivery trainwreck. Also you cannot call into command queue code from interrupt disabled and interrupt descriptor lock held sections. You can try, but lockdep will yell at you immediately. There is also CPU hotplug where we have to force migrate an interrupt away from an outgoing CPU. This needs some serious thought. One question is whether the device can see partial updates to that memory due to the async 'swap' of context from the device CPU. So we have address_lo, address_hi, data and ctrl. Each of them 32 bit. address_hi is only relevant when the number of CPUs is > 255 which requires X2APIC which in turn requires interrupt remapping. For all others the address_hi value never changes. Let's ignore that case for now, but see further down. So what's interesting is address_lo and data. If the device sees an partial update, i.e. address_lo is written and the device grabs the update before data is written then the next interrupt will end up in lala land. We have code for that in place in msi_set_affinity() in arch/x86/kernel/apic/msi.c. Get eyecancer protection glasses before opening that and keep beer ready to wipe out the horrors immediately afterwards. If the device updates the data only when a command is issued then this is not a problem, but it causes other problems because you still cannot access the command queue from that context. This makes it even worse for the CPU hotplug case. But see all of the reasoning on that. If it takes whatever it sees while grabbing the state when switching to a different queue or at the point of actual interrupt delivery, then you have a problem. Not only you, I'm going to have one as well because I'm going to be the poor sod to come up with the workaround. So we better address that _before_ you start deploying this piece of art. I'm not really interested in another slighly different and probably more horrible version of the same story. Don't blame me, it's the way how Intel decided to make this "work". There are a couple of options to ensure that the device will never see inconsistent state: 1) Use a locked 16 byte wide operation (cpmxchg16) which is not available on 32bit 2) Order the MSG entry differently in the queue storage: u32 address_lo u32 data u32 address_hi u32 ctrl And then enforce an 8 byte store on 64 bit which is guaranteed to be atomic vs. other CPUs and bus agents, i.e. DMA. I said enforce because compilers are known to do stupid things. Both are fine for me and the only caveat is that the access does not go accross a cache line boundary. The restriction to 64bit shouldn't be a problem either. Running such a device on 32bit causes more problems than it solves :) > The spin/sleep is only needed if the update has to be synchronous, so > things like rebalancing could just push the rebalancing work and > immediately return. Interrupt migration is async anyway. An interrupt might have been sent to the old vector just before the new vector was written. That's already dealt with. The old vector is cleaned up when the first interrupt arrives on the new vector which is the most reliable indicator that it's done. In that case you can avoid issuing a command, but that needs some thought as well when the queue data is never reloaded. But you can mark the queue that affinity has changed and let the next operation on the queue (RX, TX, whatever) which needs to talk to the device anyway deal with it, i.e. set some command flag in the next operation which tells the queue to reload that message. The only exception is CPU hotplug, but I have an idea how to deal with that. Aside of that some stuff want's to be synchronous though. e.g. shutdown, startup. irq chips have already a mechanism in place to deal with stuff which cannot be handled from within the irq descriptor spinlock held and interrupt disabled section. The mechanism was invented to deal with interrupt chips which are connected to i2c, spi, etc.. The access to an interrupt chip control register has to queue stuff on the bus and wait for completion. Obviously not what you can do from interrupt disabled, raw spinlock held context either. So we have for most operations (except affinity setting) the concept of update on lock release. For these devices the interrupt chip which handles all lines on that controller on the slow bus has an additional lock, called bus lock. The core code does not know about that lock at all. It's managed at the irq chip side. The irqchip has two callbacks: irq_bus_lock() and irq_bus_sync_unlock(). irq_bus_lock() is invoked before interrupts are disabled and the spinlock is taken and irq_bus_sync_unlock() after releasing the spinlock and reenabling interrupts. The "real" chip operations like mask, unmask etc. are operating on an chip internal state cache. For such devices irq_bus_lock() usually takes a sleepable lock (mutex) to protect the state cache and the update logic over the slow bus. irq_bus_sync_unlock() releases that lock, but before doing so it checks whether the operation has changed the state cache and if so it queues a command on the slow bus and waits for completion. That makes sure that the device state and the state cache are in sync before the next operation on a maybe different irq line on the same chip happens. Now for your case you might just not have irq_mask()/irq_unmask() callbacks or simple ones which just update the queue memory in RAM, but then you want irq_disable()/irq_enable() callbacks which manipulate state cache and then provide the irq_bus_lock() and irq_bus_sync_unlock() callbacks as well which do not necessarily need a lock underneath, but the unlock side implements the 'Queue command and wait for completion' part. Now coming back to affinity setting. I'd love to avoid adding the bus lock magic to those interfaces because until now they can be called and are called from atomic contexts. And obviously none of the devices which use the buslock magic support affinity setting because they all deliver a single interrupt to a demultiplex interrupt and that one is usually sitting at the CPU level where interrupt steering works. If we really can get away with atomically updating the message as outlined above and just let it happen at some point in the future then most problems are solved, except for the nastyness of CPU hotplug. But that's actually a non issue. Nothing prevents us from having an early 'migrate interrupts away from the outgoing CPU hotplug state' which runs in thread context and can therefore utilize the buslock mechanism. Actually I was thinking about that for other reasons already. That state would need some thought and consequently some minor changes to the affinity mask checks to prevent that the interrupt gets migrated back to the outgoing CPU before that CPU reaches offline state. Nothing fundamental though. Just to be clear: We really need to do that at the core level and not again in some dark place in a driver as that will cause state inconsistency and hard to debug wreckage. >> If interrupt remapping is enabled then both are trivial because then the >> irq chip can delegate everything to the parent chip, i.e. the remapping >> unit. > > I did like this notion that IRQ remapping could avoid the overhead of > spin/spleep. Most of the use cases we have for this will require the > IOMMU anyhow. You still need to support !remap scenarios I fear. And even for the remap case you need some of that bus lock magic to handle startup and teardown properly without the usual horrible hacks in the driver. If your hard^Wfirmware does the right thing then the only place you need to worry about the command queueing is startup and teardown and the extra bit for the early hotplug migration. Let me summarize what I think would be the sane solution for this: 1) Utilize atomic writes for either all 16 bytes or reorder the bytes and update 8 bytes atomically which is sufficient as the wide address is only used with irq remapping and the MSI message in the device is never changed after startup. 2) No requirement for issuing a command for regular migration operations as they have no requirements to be synchronous. Eventually store some state to force a reload on the next regular queue operation. 3) No requirement for issuing a command for mask and unmask operations. The core code uses and handles lazy masking already. So if the hardware causes the lazyness, so be it. 4) Issue commands for startup and teardown as they need to be synchronous 5) Have an early migration state for CPU hotunplug which issues a command from appropriate context. That would even allow to handle queue shutdown for managed interrupts when the last CPU in the managed affinity set goes down. Restart of such a managed interrupt when the first CPU in an affinity set comes online again would only need minor modifications of the existing code to make it work. Thoughts? Thanks, tglx