On Wed, Feb 7, 2024 at 12:57 PM Björn Töpel <bjorn@xxxxxxxxxx> wrote: > > Hi! > > Anup Patel <anup@xxxxxxxxxxxxxx> writes: > > > On Tue, Feb 6, 2024 at 9:09 PM Björn Töpel <bjorn@xxxxxxxxxx> wrote: > >> > >> Hi Anup, > >> > >> Anup Patel <apatel@xxxxxxxxxxxxxxxx> writes: > >> > >> > The RISC-V AIA specification is ratified as-per the RISC-V international > >> > process. The latest ratified AIA specifcation can be found at: > >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf > >> > > >> > At a high-level, the AIA specification adds three things: > >> > 1) AIA CSRs > >> > - Improved local interrupt support > >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) > >> > - Per-HART MSI controller > >> > - Support MSI virtualization > >> > - Support IPI along with virtualization > >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) > >> > - Wired interrupt controller > >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator) > >> > - In Direct-mode, injects external interrupts directly into HARTs > >> > > >> > For an overview of the AIA specification, refer the AIA virtualization > >> > talk at KVM Forum 2022: > >> > https://static.sched.com/hosted_files/kvmforum2022/a1/AIA_Virtualization_in_KVM_RISCV_final.pdf > >> > https://www.youtube.com/watch?v=r071dL8Z0yo > >> > >> Thank you for continuing to work on this series! I like this > >> direction of the series! > >> > >> TL;DR: I think we can get rid of most of the id/householding data > >> structures, except for the irq matrix. > >> > >> Most of my comments are more of a design/overview nature, so I'll > >> comment here in the cover letter. > >> > >> I took the series for a spin with and it with Alex' ftrace fix it, > >> passes all my tests nicely! > >> > >> Now some thoughts/comments (I'm coming from the x86 side of things!): > >> > >> id/enable-tracking: There are a lot of different id/enabled tracking > >> with corresponding locks, where there's IMO overlap with what the > >> matrix provides. > > > > The matrix allocator does not track the enabled/disabled state of > > the per-CPU IDs. This is why we have a separate per-CPU > > ids_enabled_bitmap which is also used for remote synchronization > > across CPUs. > > Exactly, but what I'm asking is if that structure is really needed. More > below. > > >> Let's start with struct imsic_priv: > >> > >> | /* Dummy HW interrupt numbers */ > >> | unsigned int nr_hwirqs; > >> | raw_spinlock_t hwirqs_lock; > >> | unsigned long *hwirqs_used_bitmap; > > > > The matrix allocator manages actual IDs for each CPU whereas > > the Linux irq_data expects a fixed hwirq which does not change. > > > > Due to this, we have a dummy hwirq space which is always > > fixed. The only thing that is changed under-the-hood by the > > IMSIC driver is the dummy hwirq to actual HW vector (cpu, id) > > mapping. > > Read below. I'm not talking about local_id from the irq_matrix, I'm > saying use virq, which has the properties you're asking for, and doesn't > require an additional structure. When an irq/desc is allocated, you have > a nice unique number with the virq for the lifetime of the interrupt. Sure, let me explore using virq in-place of hwirq. > > >> These are used to for the domain routing (hwirq -> desc/virq), and not > >> needed. Just use the same id as virq (at allocation time), and get rid > >> of these data structures/corresponding functions. The lookup in the > >> interrupt handler via imsic_local_priv.vectors doesn't care about > >> hwirq. This is what x86 does... The imsic_vector roughly corresponds > >> to apic_chip_data (nit: imsic_vector could have the chip_data suffix > >> as well, at least it would have helped me!) > > > > Yes, imsic_vector corresponds to apic_chip_data in the x86 world. > > ...and I'm trying to ask the following; Given the IMSIC is pretty much > x86 vector (arch/x86/kernel/apic/vector.c), I'm trying to figure out the > rational why IMSIC has all the extra householding data, not needed by > x86. The x86 has been battle proven, and having to deal with all kind of > quirks (e.g. lost interrupts on affinity changes). Understood. > > >> Moving/affinity changes. The moving of a vector to another CPU > >> currently involves: > >> > >> 1. Allocate a new vector from the matrix > >> 2. Disable/enable the corresponding per-cpu ids_enabled_bitmap (nested > >> spinlocks) > >> 3. Trigger two IPIs to apply the bitmap > >> 4. On each CPU target (imsic_local_sync()) loop the bitmap and flip > >> all bits, and potentially rearm > >> > >> This seems a bit heavy-weight: Why are you explicitly setting/clearing > >> all the bits in a loop at the local sync? > > > > This can be certainly optimized by introducing another > > ids_dirty_bitmap. I will add this in the next revision. > > I rather have fewer maps, and less locks! ;-) > > >> x86 does it a bit differently (more lazily): The chip_data has > >> prev_{cpu,vector}/move_in_progress fields, and keep both vectors > >> enabled until there's an interrupt on the new vector, and then the old > >> one is cleaned (irq_complete_move()). > >> > >> Further; When it's time to remove the old vector, x86 doesn't trigger > >> an IPI on the disabling side, but queues a cleanup job on a per-cpu > >> list and triggers a timeout. So, the per-cpu chip_data (per-cpu > >> "vectors" in your series) can reside in two places during the transit. > > > > We can't avoid IPIs when moving vectors from one CPU to another > > CPU because IMSIC id enable/disable is only possible through > > CSRs. Also, keep in-mind that irq affinity change might be initiated > > on CPU X for some interrupt targeting CPU Y which is then changed > > to target CPU Z. > > > > In the case of x86, they have memory mapped registers which > > allows one CPU to enable/disable the ID of another CPU. > > Nope. Same mechanics on x86 -- the cleanup has to be done one the > originating core. What I asked was "what about using a timer instead of > an IPI". I think this was up in the last rev as well? > > Check out commit bdc1dad299bb ("x86/vector: Replace > IRQ_MOVE_CLEANUP_VECTOR with a timer callback") Specifically, the > comment about lost interrupts, and the rational for keeping the original > target active until there's a new interrupt on the new cpu. Trying timer interrupt is still TBD on my side because with v12 my goal was to implement per-device MSI domains. Let me explore timer interrupts for v13. > > >> I wonder if this clean up is less intrusive, and you just need to > >> perform what's in the per-list instead of dealing with the > >> ids_enabled_bitmap? Maybe we can even remove that bitmap as well. The > >> chip_data/desc has that information. This would mean that > >> imsic_local_priv() would only have the local vectors (chip_data), and > >> a cleanup list/timer. > >> > >> My general comment is that instead of having these global id-tracking > >> structures, use the matrix together with some desc/chip_data local > >> data, which should be sufficient. > > > > The "ids_enabled_bitmap", "dummy hwirqs" and private imsic_vectors > > are required since the matrix allocator only manages allocation of > > per-CPU IDs. > > The information in ids_enabled_bitmap is/could be inherent in > imsic_local_priv.vectors (guess what x86 does... ;-)). > > Dummy hwirqs could be replaced with the virq. > > Hmm, seems like we're talking past each other, or at least I get the > feeling I can't get my opinions out right. I'll try to do a quick PoC, > to show you what I mean. That's probably easier than just talking about > it. ...and maybe I'll come realizing I'm all wrong! I suggest to wait for my v13 and try something on top of that otherwise we might duplicate efforts. > > My reaction is -- you're doing a lot of householding with a lot of > locks, and my worry is that we'll just end up with same issues/bloat > that x86 once had (has? ;-)). > > >> Random thought: Do we need to explicitly disable (csr) the vector, > >> when we're changing the affinity? What if we just leave it enabled, > >> and only when mask/unmask is performed it's actually explicitly masked > >> (writes to the csr)? > > > > We should not leave it enabled because some rough/buggy device > > can inject spurious interrupts using MSI writes to unused enabled > > interrupts. > > OK! > > >> > >> Missing features (which can be added later): > >> * Reservation mode/activate support (allocate many MSI, but only > >> request/activate a subset) > > > > I did not see any PCIe or platform device requiring this kind of > > reservation. Any examples ? > > It's not a requirement. Some devices allocate a gazillion interrupts > (NICs with many QoS queues, e.g.), but only activate a subset (via > request_irq()). A system using these kind of devices might run out of > interrupts. I don't see how this is not possible currently. > > Problems you run into once you leave the embedded world, pretty much. > > >> * Handle managed interrupts > > > > Any examples of managed interrupts in the RISC-V world ? > > E.g. all nvme drives: nvme_setup_irqs(), and I'd assume contemporary > netdev drivers would use it. Typically devices with per-cpu queues. We have tested with NVMe devices, e1000e, VirtIO-net, etc and I did not see any issue. We can always add new features as separate incremental series as long as there is clear use-cause backed by real-world devices. > > >> * There might be some irqd flags are missing, which mostly cpuhp care > >> about (e.g. irqd_*_single_target())... > > > > Okay, let me check and update. > > I haven't dug much into cpuhp, so I'm out on a limb here... > > >> Finally; Given that the APLIC requires a lot more patches, depending > >> on how the review process moves on -- maybe the IMSIC side could go as > >> a separate series? > >> > > > > The most popular implementation choice across RISC-V platforms > > will be IMSIC + APLIC so both drivers should go together. In fact, > > we need both drivers for the QEMU virt machine as well because > > UART interrupt (and other wired interrupts) on the QEMU virt > > machine goes through APLIC. > > Thanks for clearing that out! Hmm, an IMSIC only QEMU would be awesome. > > > Cheers, > Björn Regards, Anup