Hi Ralf, On Wed, Dec 06, 2017 at 02:57:45PM +0100, Ralf Baechle wrote: > On Tue, Dec 05, 2017 at 10:28:22PM +0000, James Hogan wrote: > > > Since commit 68923cdc2eb3 ("MIPS: CM: Add cluster & block args to > > mips_cm_lock_other()"), mips_smp_send_ipi_mask() has used > > mips_cm_lock_other_cpu() with each CPU number, rather than > > mips_cm_lock_other() with the first VPE in each core. Prior to r6, > > multicore multithreaded systems such as dual-core dual-thread > > interAptivs with CPU Idle enabled (e.g. MIPS Creator Ci40) results in > > mips_cm_lock_other() repeatedly hitting WARN_ON(vp != 0). > > > > There doesn't appear to be anything fundamentally wrong about passing a > > non-zero VP/VPE number, even if it is a core's region that is locked > > into the other region before r6, so remove that particular WARN_ON(). > > > > Fixes: 68923cdc2eb3 ("MIPS: CM: Add cluster & block args to mips_cm_lock_other()") > > Signed-off-by: James Hogan <jhogan@xxxxxxxxxx> > > Reviewed-by: Paul Burton <paul.burton@xxxxxxxx> > > Cc: Ralf Baechle <ralf@xxxxxxxxxxxxxx> > > Cc: linux-mips@xxxxxxxxxxxxxx > > Cc: <stable@xxxxxxxxxxxxxxx> # 4.14+ > > --- > > arch/mips/kernel/mips-cm.c | 1 - > > 1 file changed, 1 deletion(-) > > > > diff --git a/arch/mips/kernel/mips-cm.c b/arch/mips/kernel/mips-cm.c > > index dd5567b1e305..8f5bd04f320a 100644 > > --- a/arch/mips/kernel/mips-cm.c > > +++ b/arch/mips/kernel/mips-cm.c > > @@ -292,7 +292,6 @@ void mips_cm_lock_other(unsigned int cluster, unsigned int core, > > *this_cpu_ptr(&cm_core_lock_flags)); > > } else { > > WARN_ON(cluster != 0); > > - WARN_ON(vp != 0); > > I think the reason is that for a while the combination of SMP/CMP with > MT was at the bottom of priorities and nobody really cared about it so > a WARN_ON was thrown in. Which in this case might well itself be a bug! > > Ralf Well, the WARN_ON came about because: - Originally mips_cm_lock_other() had no VP argument, because the CM "other" register in CM < 3 has no VP field. ie. the CM "other" region always accesses registers associated with a core rather than a VP(E). - With CM >= 3 the "other" register (renamed "redirect") gained a VP field & the ability to control the CPC & GIC redirect regions, which were formerly controlled by separate "other" registers in the CPC & GIC register interfaces. In the GIC case the other/redirect region actually targets a VP, not a core. - Since the CM < 3 other register has no VP field the WARN_ON made sense - if code is actually trying to access registers associated with a particular VP then that's a bug because that's simply not possible. The warning called out those cases by requiring CM < 3 paths to explicitly pass VP=0. That was all well & good until I added the mips_cm_lock_other_cpu() wrapper for convenience. It passes the CPUs cluster, core & VP(E) numbers to mips_cm_lock_other() which is problematic if it's used in paths that may run on CM < 3. There is only one such case right now, in mips_smp_send_ipi_mask() when powering up a CPU in a non-coherent idle state. So effectively you have to have the CPS cpuidle driver enabled to hit this. A potential alternative to this patch would be to have mips_cm_lock_other_cpu() check the CM version & use VP=0 for CM < 3, but I think the utility of that is questionable so dropping the WARN_ON as James has done seems reasonable. We actually do test mainline, mips-for-linux-next & our internal kernel branches a bunch on affected systems, most notably interAptiv in the context of either the Boston or Ci40 platforms. Unfortunately that didn't catch the bug here because the cpuidle driver isn't enabled in the kernel configs that our CI system builds. That's something we'll change. If that weren't the case then we ought to have caught this once it hit -next. The patchset for multi-cluster support has also shrunk quite a lot with the merging of the initial set, so our internal branches are now closer to what will next be submitted which also helps since we might spot issues even earlier in tests of those internal branches. Having the mips-for-linux-next branch updated regularly throughout the cycle would help immensely in giving us the best chance of catching bugs like this in a timely fashion, since when it all happens right before the merge window we don't get much testing time on the code actually hitting mainline until after it has gone in. Anyway, hopefully that post-mortem shines some light on what went wrong here! Thanks, Paul