On Wed, Mar 05 2025 at 19:21, Tsai Sung-Fu wrote: > On Tue, Mar 4, 2025 at 5:46 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> >> > + ret = desc_parent->irq_data.chip->irq_set_affinity(&desc_parent->irq_data, >> >> > + mask_result, force); >> >> >> >> Again. Completely unserialized. >> >> >> > >> > The reason why we remove the desc_parent->lock protection is because >> > the chained IRQ structure didn't export parent IRQ to the user space, so we >> > think this call path should be the only caller. And since we have &pp->lock hold >> > at the beginning, so that should help to protect against concurrent >> > modification here ? >> >> "Should be the only caller" is not really a technical argument. If you >> make assumptions like this, then you have to come up with a proper >> explanation why this is correct under all circumstances and with all >> possible variants of parent interrupt chips. >> >> Aside of that fiddling with the internals of interrupt descriptors is >> not really justified here. What's wrong with using the existing >> irq_set_affinity() mechanism? >> > Using irq_set_affinity() would give us some circular deadlock warning > at the probe stage. > irq_startup() flow would hold the global lock from irq_setup_affinity(), so > deadlock scenario like this below might happen. The irq_set_affinity() > would try to acquire &irq_desc_lock_class > while the dw_pci_msi_set_affinity() already hold the &pp->lock. To > resolve that we would like to reuse the idea to > replace the global lock in irq_set_affinity() with PERCPU structure > from this patch -> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=64b6d1d7a84538de34c22a6fc92a7dcc2b196b64 > Just would like to know if this looks feasible to you ? It won't help. See below. > Chain exists of: > &irq_desc_lock_class --> mask_lock --> &pp->lock > Possible unsafe locking scenario: > CPU0 CPU1 > ---- ---- > lock(&pp->lock); > lock(mask_lock); > lock(&pp->lock); > lock(&irq_desc_lock_class); Indeed. Back to irq_set_affinity(). I did not think about the more problematic issue with calling irq_set_affinity(): The per CPU mask is then used recursive, which is a bad idea. But looking deeper, your implementation breaks some other things, like e.g. the affinity coupling of interrupt threads on the MSI side. They have to be coupled to the effective affinity of the underlying hard interrupt. Solvable problem, but not in the driver. Aside of the fact that this hardware design is a trainwreck, which defeats the intended flexibility of MSI, solving this at the chip driver level ends up in a incomplete and hacky solution. Also as this is not the only IP block, which was cobbled together mindlessly that way, we really don't want to proliferate this horrorshow all over the place. This really wants a generic infrastructure which handles all aspects of this correctly once and for all of these hardware trainwrecks. That needs some thoughts and effort, but that's definitely preferred over a half baked driver specific hackery, which fiddles in the guts of the interrupt descriptors as it sees fit. From a layering POV, a interrupt chip driver should not even care about the fact that an interrupt descriptor exists. We've spent a lot of effort to decouple these things and just because C allows it, does not mean it's a correct or a good idea to ignore the layering. The approach you have taken has a massive downside: It forces all (32) MSI interrupts, which are demultiplexed from the underlying parent interrupt, to have the same affinity and the user space interface in /proc/irq/*/affinity becomes more than confusing. This is not what user space expects from an interface which controls the affinity of an individual interrupt. No user and no tool expects that it has to deal with conflicting settings for MSI interrupts, which are independent of each other. Tools like irqbalanced will be pretty unhappy about the behaviour and you have to teach them about the severe limitations of this. I'm absolutely not convinced that this solves more problems than it creates. In fact, you can achieve the very same outcome with a clean and straight forward ten line change: Instead of the hidden chained handlers, request regular interrupts for demultiplexing and give them proper names 'DWC_MSI_DEMUX-0-31', 'DWC_MSI_DEMUX-32-63'... That provides exactly the same mechanism w/o all the hackery and creates the same (solvable) problem vs. interrupt thread affinity, but provides an understandable user interface. No? If you really want to provide a useful and flexible mechanism to steer the affinity of the individual MSI interrupts, which are handled by each control block, then you have to think about a completely different approach. The underlying interrupt per control block will always be affine to a particular CPU. But it's not written in stone, that the demultiplexing interrupt actually has to handle the pending interrupts in the context of the demultiplexer, at least not for MSI interrupts, which are edge type and from the PCI device side considered 'fire and forget'. So right now the demultiplex handler does: bits = read_dbi(...); for_each_bit(bit, bits) generic_handle_domain_irq(domain, base + bit); generic_handle_irq_domain() gets the interrupt mapping for the hardware interrupt number and invokes the relevant flow handler directly from the same context, which means that the affinity of all those interrupts is bound to the affinity of the underlying demultiplex interrupt. Iignoring the devil in the details, this can also be handled by delegating the handling to a different CPU: bits = read_dbi(...); for_each_bit(bit, bits) generic_handle_irq_demux_domain(domain, base + bit); where generic_handle_demux_domain_irq() does: desc = irq_resolve_mapping(domain, hwirq); if (desc->target_cpu == smp_processor_id()) { handle_irq_desc(desc); } else { // Issue DWC ack() here? - See below irq_work_queue_on(&desc->irq_work, desc->target_cpu); } The generic irq_work handler does: desc = container_of(work, struct irq_desc, irq_work); handle_irq_desc(desc); That's obviously asynchronous, but in case of edge type MSI interrupts, that's not a problem at all. The real problems are: 1) Handling the pending bit of the interrupt in the DBI, i.e. what dw_pci_bottom_ack() does. 2) Life time of the interrupt descriptor 3) Slightly increased latency of the rerouted interrupts 4) Affinity of the demultiplex interrupt #1 Needs some thoughts. From the top of my head I think it would require that dw_pci_bottom_ack() is issued in the context of the demultiplex handler as it would otherwise be retriggered, but that just needs some research in the datasheet and experimental validation. The actual interrupt flow handler for the MSI interrupt is then not longer handle_edge_irq() as it should not issue the ack() again. From a correctness POV vs. interrupt handling of a edge type interrupt, that's perfectly fine. #2 Trivial enough to solve by flushing and invalidating the irq_work before shutdown/removal. #3 That might be an issue for some extreme scenarios, but in the general case I claim, that it does not matter at all. Those scenarios where it really matters can affine the interrupt to the affinity of the demultiplexing interrupt, which leads to #4 #4 That's a trivial problem to solve. There is absolutely no technical reason to make them hidden. Demultiplexing just works fine from the context of regular interrupts. There are some other minor issues like CPU hotplug, but I can't see anything truly complex in this design right now. Famous last words :) At the driver level this becomes really simple. Affinity changes are just selecting a target CPU and store it for trivial comparison to avoid a CPU mask check in the demultiplex path. No fiddling with sibling or parent interrupts, no locking issues. Thread affinity and other things just work as expected. I'm really tempted to play with that. Such infrastructure would allow to support PCI/multi-MSI on x86 without interrupt remapping, which is on the wishlist of a lot of people for many years. Thanks, tglx