On 03/01/18 18:13, Ganapatrao Kulkarni wrote: > On Wed, Jan 3, 2018 at 5:06 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote: >> On 03/01/18 11:20, Ganapatrao Kulkarni wrote: >>> On Wed, Jan 3, 2018 at 3:43 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote: >>>> On 03/01/18 09:35, Ganapatrao Kulkarni wrote: >>>>> Hi Marc, >>>>> >>>>> On Wed, Jan 3, 2018 at 2:17 PM, Marc Zyngier <marc.zyngier@xxxxxxx> wrote: >>>>>> On 03/01/18 06:32, Ganapatrao Kulkarni wrote: >>>>>>> When an interrupt is moved across node collections on ThunderX2 >>>>>> >>>>>> node collections? >>>>> >>>>> ok, i will rephrase it. >>>>> i was intended to say cross NUMA node collection/cpu affinity change. >>>>> >>>>>> >>>>>>> multi Socket platform, an interrupt stops routed to new collection >>>>>>> and results in loss of interrupts. >>>>>>> >>>>>>> Adding workaround to issue INV after MOVI for cross-node collection >>>>>>> move to flush out the cached entry. >>>>>>> >>>>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni@xxxxxxxxxx> >>>>>>> --- >>>>>>> Documentation/arm64/silicon-errata.txt | 1 + >>>>>>> arch/arm64/Kconfig | 11 +++++++++++ >>>>>>> drivers/irqchip/irq-gic-v3-its.c | 24 ++++++++++++++++++++++++ >>>>>>> 3 files changed, 36 insertions(+) >>>>>>> >>>>>>> diff --git a/Documentation/arm64/silicon-errata.txt b/Documentation/arm64/silicon-errata.txt >>>>>>> index fc1c884..fb27cb5 100644 >>>>>>> --- a/Documentation/arm64/silicon-errata.txt >>>>>>> +++ b/Documentation/arm64/silicon-errata.txt >>>>>>> @@ -63,6 +63,7 @@ stable kernels. >>>>>>> | Cavium | ThunderX Core | #27456 | CAVIUM_ERRATUM_27456 | >>>>>>> | Cavium | ThunderX Core | #30115 | CAVIUM_ERRATUM_30115 | >>>>>>> | Cavium | ThunderX SMMUv2 | #27704 | N/A | >>>>>>> +| Cavium | ThunderX2 ITS | #174 | CAVIUM_ERRATUM_174 | >>>>>>> | Cavium | ThunderX2 SMMUv3| #74 | N/A | >>>>>>> | Cavium | ThunderX2 SMMUv3| #126 | N/A | >>>>>>> | | | | | >>>>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>>>>>> index c9a7e9e..71a7e30 100644 >>>>>>> --- a/arch/arm64/Kconfig >>>>>>> +++ b/arch/arm64/Kconfig >>>>>>> @@ -461,6 +461,17 @@ config ARM64_ERRATUM_843419 >>>>>>> >>>>>>> If unsure, say Y. >>>>>>> >>>>>>> +config CAVIUM_ERRATUM_174 >>>>>>> + bool "Cavium ThunderX2 erratum 174" >>>>>>> + depends on NUMA >>>>>> >>>>>> Why? This system will be affected no matter whether NUMA is selected or not. >>>>> >>>>> it does not makes sense to enable on non-NUMA/single socket platforms. >>>>> By default NUMA is enabled on ThunderX2 dual socket platforms. >>>> >>>> <quote> >>>> config ARCH_THUNDER2 >>>> bool "Cavium ThunderX2 Server Processors" >>>> select GPIOLIB >>>> help >>>> This enables support for Cavium's ThunderX2 CN99XX family of >>>> server processors. >>>> </quote> >>>> >>>> Do you see any NUMA here? I can perfectly compile a kernel with both >>>> sockets, and not using NUMA. NUMA has to do with memory, and not interrupts. >>> >>> ok, i will remote it. >>>> >>>>> >>>>>> >>>>>>> + default y >>>>>>> + help >>>>>>> + LPI stops routed to redistributors after inter node collection >>>>>>> + move in ITS. Enable workaround to invalidate ITS entry after >>>>>>> + inter-node collection move. >>>>>> >>>>>> That's a very terse description. Nobody knows what an LPI, a >>>>>> redistributor or a collection is. Please explain what the erratum is in >>>>>> layman's terms (Cavium ThunderX2 systems may loose interrupts on >>>>>> affinity change) so that people understand whether or not they are affected. >>>>> >>>>> ok, i will rephrase it in next version. >>>>>> >>>>>>> + >>>>>>> + If unsure, say Y. >>>>>>> + >>>>>>> config CAVIUM_ERRATUM_22375 >>>>>>> bool "Cavium erratum 22375, 24313" >>>>>>> default y >>>>>>> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c >>>>>>> index 06f025f..d8b9c96 100644 >>>>>>> --- a/drivers/irqchip/irq-gic-v3-its.c >>>>>>> +++ b/drivers/irqchip/irq-gic-v3-its.c >>>>>>> @@ -46,6 +46,7 @@ >>>>>>> #define ITS_FLAGS_CMDQ_NEEDS_FLUSHING (1ULL << 0) >>>>>>> #define ITS_FLAGS_WORKAROUND_CAVIUM_22375 (1ULL << 1) >>>>>>> #define ITS_FLAGS_WORKAROUND_CAVIUM_23144 (1ULL << 2) >>>>>>> +#define ITS_FLAGS_WORKAROUND_CAVIUM_174 (1ULL << 3) >>>>>>> >>>>>>> #define RDIST_FLAGS_PROPBASE_NEEDS_FLUSHING (1 << 0) >>>>>>> >>>>>>> @@ -1119,6 +1120,12 @@ static int its_set_affinity(struct irq_data *d, const struct cpumask *mask_val, >>>>>>> if (cpu != its_dev->event_map.col_map[id]) { >>>>>>> target_col = &its_dev->its->collections[cpu]; >>>>>>> its_send_movi(its_dev, target_col, id); >>>>>>> + if (its_dev->its->flags & ITS_FLAGS_WORKAROUND_CAVIUM_174) { >>>>>>> + /* Issue INV for cross node collection move. */ >>>>>>> + if (cpu_to_node(cpu) != >>>>>>> + cpu_to_node(its_dev->event_map.col_map[id])) >>>>>>> + its_send_inv(its_dev, id); >>>>>>> + } >>>>>> >>>>>> What happens if an interrupt happens after the MOV, but before the INV? >>>>> >>>>> there can be drop, if interrupt happens before INV, however, it is >>>>> highly unlikely that we will hit the issue since MOVI and INV are >>>>> executed back to back. this workaround fixed issue seen on couple of >>>>> IOs. >>>> >>>> Really? So this doesn't fix anything, and the device may just wait >>>> forever for the CPU to service an LPI that was never delivered. I'm >>>> sorry, but that's not an acceptable workaround. >>>> >>>> I can see two solutions: >>>> 1) you inject an interrupt after the INV as you may have lost at least one >>>> 2) you restrict the affinity of LPIs to a single socket >>>> >>>> (1) will generate spurious interrupts, but will be safe. (2) will result >>>> in an unbalanced system. Pick your poison... >>>> >>>> You may be able to achieve something by disabling the LPI before >>>> performing the MOVI/INV sequence, and reenable it afterwards, but only >>>> you can tell if this could work with your HW. >>> >>> thanks for the suggestion, i will try out disable/enable LPI. >>> the sequence would be, >>> Disable LPI in rdist of cpu1 >>> MOVI from cpu1 to cpu2 >>> INV >>> enable LPI in rdist of cpu1 >> >> You cannot disable LPIs at the redistributor level. For a start, this is >> not allowed by the architecture. It would also result in all the other >> LPIs to be lost. What you can do is to disable the single LPI you're >> about to move: >> >> its_mask_irq(d); >> MOVI >> its_unmask_irq(d); > > thanks, this also fixes the issue. > i shall post v2 with this sequence. Does it really? What if you get this LPI between the MOVI and the INV? I bet it drops it on the floor. M. -- Jazz is not dead. It just smells funny... -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html