On Wed, Dec 04 2024 at 09:13, Anup Patel wrote: > On Wed, Dec 4, 2024 at 4:29 AM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> As I was looking at something else MSI related I had a look at >> imsic_irq_set_affinity() again. >> >> It's actually required to have the message write in that function and >> not afterwards as you invoke imsic_vector_move() from that function. >> >> That's obviously not true for the remap case as that will not change the >> message address/data pair because the remap table entry is immutable - >> at least I assume so for my mental sanity sake :) >> >> But that brings me to a related question. How is this supposed to work >> with non-atomic message updates? PCI/MSI does not necessarily provide >> masking, and the write of the address/data pair is done in bits and >> pieces. So you can end up with an intermediate state seen by the device >> which ends up somewhere in interrupt nirvana space. >> >> See the dance in msi_set_affinity() and commit 6f1a4891a592 >> ("x86/apic/msi: Plug non-maskable MSI affinity race") for further >> explanation. >> >> The way how the IMSIC driver works seems to be pretty much the same as >> the x86 APIC mess: >> >> @address is the physical address of the per CPU MSI target >> address and @data is the vector ID on that CPU. >> >> So the non-atomic update in case of non-maskable MSI suffers from the >> same problem. It works most of the time, but if it doesn't you might >> stare at the occasionally lost interrupt and the stale device in >> disbelief for quite a while :) > > Yes, we have the same challenges as x86 APIC when changing > MSI affinity. >> >> I might be missing something which magically prevent that though :) >> > Your understanding is correct. In fact, the IMSIC msi_set_affinity() > handling is inspired from x86 APIC approach due to similarity in > the overall MSI controller. > > The high-level idea of imsic_irq_set_affinity() is as follows: > > 1) Allocate new_vector (new CPU IMSIC address + new ID on that CPU) > > 2) Update the MSI address and data programmed in the device > based on new_vector (see imsic_msi_update_msg()) > > 3) At this point the device points to the new_vector but old_vector > (old CPU IMSIC address + old ID on that CPU) is still enabled and > we might have received MSI on old_vector while we were busy > setting up a new_vector for the device. To address this, we call > imsic_vector_move(). > > 4) The imsic_vector_move() marks the old_vector as being > moved and schedules a lazy timer on the old CPU. > > 5) The lazy timer expires on the old CPU and results in > __imsic_local_sync() being called on the old CPU. > > 6) If there was a pending MSI on the old vector then the > __imsic_local_sync() function injects an MSI to the > new_vector using an MMIO write. > > It is very unlikely that an MSI from device will be dropped > (unless I am missing something) but the unsolved issue > is that handling of in-flight MSI received on the old_vector > during the MSI re-programming is delayed which may have > side effects on the device driver side. Interrupt delivery can be delayed for tons of other reasons. But yes, you are missing something which is worse than a jiffie delay: CPU Device msi_update_msg() compose() write() write(msg->address_lo); [1] write(msg->address_hi); [2] Raises interrupt [3] write(msg->data); [4] [2] can be ignored as it should not change (otherwise 32bit only devices would not work). Lets assume that the original message was: addr_lo = 0x1000, data = 0x10 (CPU1, vector 0x10) The new message is addr_lo = 0x2000, data = 0x20 (CPU2, vector 0x20) After [2] the device sees: addr_lo = 0x2000, data = 0x10 (CPU2, vector 0x10) The interrupt raised in [3] will end up on CPU2 at vector 0x10 which might be not in use or used by some other device. In any case the interrupt is not reaching the real device handler and you can't see that interrupt as pending in CPU1. That's why x86 in case of non-remapped interrupts has this for the two situations: 1) old->data == new->data write_msg(new); The next interrupt either arrives on the old CPU or on the new CPU depending on timing. There is no intermediate state because the vector (data) is the same on both CPUs. 2) old->data != new->data tmp_msg.addr = old_msg.addr tmp_msg.data = new_msg.data write_msg(tmp_msg); So after that write the device might raise the interrupt on CPU1 and vector 0x20. The next step is write_msg(new); which changes the destination CPU to CPU2. So depending what the device has observed the interrupt might end up on CPU1 vector 0x10 (old) CPU1 vector 0x20 (tmp) CPU2 vector 0x20 (new) CPU1 vector 0x20 (tmp) is then checked in the pending register and the interrupt is retriggered if pending. That requires to move the interrupt from actual interrupt context on the old target CPU. It allows to evaluate the old target CPUs pending register with local interrupts disabled, which obviously does not work remote. I don't see a way how that can work remote with the IMSIC either even if you can easily access the pending state of the remote CPU: CPU0 CPU1 Device set_affinity() write_msg(tmp) write(addr); // CPU1 write(data); // vector 0x20 raise IRQ handle vector 0x20 (spurious or other device) check_pending(CPU1, 0x20) == false -> Interrupt is lost Remapped interrupts do not have that issue because the table update is atomic vs. a concurrently raised interrupt, so that it either ends up on the old or on the new destination. The device message does not change as it always targets the table entry, which is immutable after setup. That's why x86 has this special msi affinity setter for the non remapped case. For the remapped case it's just using the hierarchy default which ends up at the remap domain. So the hierarchies look like this: 1) non-remap PCI/MSI device domain irq_chip.irq_set_affinity = msi_set_affinity; VECTOR domain 2) remap PCI/MSI device domain irq_chip.irq_set_affinity = msi_domain_set_affinity; REMAP domain irq_chip.irq_set_affinity = remap_set_affinity; VECTOR domain The special case of msi_set_affinity() is not involved in the remap case and solely utilized for the non-remap horrors. The remap case does not require the interrupt to be moved in interrupt context either because there is no intermediate step and pending register check on the old CPU involved. The vector cleanup of the old CPU always happens when the interrupt arrives on the new target for the first time independent of remapping mode. That's required for both cases to take care of the case where the interrupt is raised on the old vector (cpu1, 0x10) before the write happens and is pending in CPU1. So after desc::lock is released and interrupts are reenabled the interrupt is handled on CPU1. The next one is guaranteed to arrive on CPU2 and that triggers the clean up of the CPU1 vector, which releases it for reuse in the matrix allocator. I think you should model it in a similar way instead of trying to artificially reuse imsic_irq_set_affinity() for the remap case, especially when you decide to close the non-maskable MSI hole described above, which I recommend to do :) You still can utilize msi-lib and just have your private implementation of init_dev_msi_info() as a wrapper around the library similar to mbi_init_dev_msi_info() and its_init_dev_msi_info(). That wrapper would just handle the non-remap case to set info::chip:irq_set_affinity to the magic non-remap function. Hope that helps. > I believe in the future RISC-V AIA v2.0 (whenever that > happens) will address the gaps in AIA v1.0 (like this one). If that is strictly translation table based, yes. Thanks, tglx