On Mon, Apr 10 2023 at 12:16, David Laight wrote: > From: Thomas Gleixner >> Sent: 10 April 2023 07:50 >> It's trivial enough to latch the message on unmask into a shadow >> register set and let the state engine work from there. > > I could implement the memory block inside the MSI_X logic > and then detect writes. > Anything else uses a lot more fpga resource. Whatever it takes there are enough ways to solve that. >> And no, we are not adding random delays to that code just because. > > I was mostly suggesting one be moved. > (Changing the address/data is hardly a hot path.) There is _no_ delay to be moved. The read back is not a delay. It's a functional requirement. > Actually I'm trying to work out what the readbacks in the MSI-X > code are actually for (apart from adding a 'random delay'). It's absolutely not random. It's to ensure that _all_ writes are visible and effective in the device before the function returns. > Now I can't remember whether the data TLP for reads is allowed > to overtake a posted write request, but it really doesn't matter. > Even if the IRQ write arrives at the root hub before the > read completion they then head off in different directions > through the 'cpu' logic - so the read could easily complete > well before the interrupt gets processed. The kernel handles the case that an already pending interrupt comes in on the original vector and CPU. But it must be guaranteed that any interrupt raised _after_ the mask/update/unmask sequence will arrive at the new destination. > In fact an interrupt requested well before the mask bit is set > could be pending in the ioapic waiting for the cpu to enable > interrupts. MSI-X interrupts are never pending in the IOAPIC. > So I don't see any reason for those readbacks at all. But you want new read backs or move the existing one, right? Can you make your mind up? > I've never looked at the cpu end of MSI-X, but I suspect that > stop using an interrupts you need to mask it there first, > then disable the hardware and later make sure you clear > any lurking pending request (probably before unmasking for > later use). You can suspect what you want. The kernel can handle this perfectly correctly without any of this: Startup: Assign vector $M on CPU $A in msi_entry Unmask in msi_entry Interrupt affinity change request: New target CPU mask is stored in the interrupt descriptor and affinity change request is queued. That's a pure kernel internal software operation and does not touch the device. Interrupt raised on CPU $A vector $M: Handle interrupt Mask in msi_entry Assign vector $N on CPU $B Unmask in msi_entry EOI Interrupt raised on CPU $B vector $N: Handle interrupt Send IPI to CPU$A to cleanup vector $M EOI Affinity for non-remapped MSI/MSI-X is always changed in context of an interrupt being handled on the original target CPU/vector. The whole mechanism relies on standard conform devices simply because it's impossible to handle the following scenario: CPU A Device CPU B write(entry->address_lo); (Changes the destination to CPU B) Writes entry->data to entry->address_lo Gets a spurious interrupt on the original vector $N As a consequence the interrupt would be lost and in the worst case the device is not longer functional. CPU A cannot access the local APIC IRR of CPU B to test for this. Even if it could, it would only observe it when CPU B cannot handle it at that moment. And no, we are not going to install a magic "maybe random PCIe device misbehaves handler" on CPU B for doing so. We simply can't and it's not necessary at all. There is a "work around" for non-maskable MSI. See msi_set_affinity() for the gory details. But also that requires that the writes are visible in the device when the write_msg() function returns. And no, we are not extending this to MSI-X simply because contrary to MSI MSI-X is well defined. I'm dealing with MSI-X for 15+ years now and there hasn't been a single device which violated the standard in that regard. At least to my knowledge. Please fix your hardware/firmware to be standard compliant. The kernel is not there to proliferate sloppy hardware designs. We simply are not opening that can of worms. Aside of that a device must be compliant to work independent of the OS and other OSes are neither interested in horrible hacks just to let hardware people do what they want. Standards are there for a reason. Just for the record. entry->address_hi is irrelvant in the above scenario because address_hi is only in use when interrupt remapping is enabled (at least on x86). If interrupt remapping is enabled, none of the above matters as the affinity change happens at the remapping level and the device MSI[-X] entries never change after setup. The remapping unit has none of those problems at all. Thanks, tglx