Re: [PATCH] MIPS: CM: Drop WARN_ON(vp != 0)

Paul Burton <paul.burton@xxxxxxxx> · Wed, 6 Dec 2017 09:43:01 -0800

Hi Ralf,

On Wed, Dec 06, 2017 at 02:57:45PM +0100, Ralf Baechle wrote:
> On Tue, Dec 05, 2017 at 10:28:22PM +0000, James Hogan wrote:
> 
> > Since commit 68923cdc2eb3 ("MIPS: CM: Add cluster & block args to
> > mips_cm_lock_other()"), mips_smp_send_ipi_mask() has used
> > mips_cm_lock_other_cpu() with each CPU number, rather than
> > mips_cm_lock_other() with the first VPE in each core. Prior to r6,
> > multicore multithreaded systems such as dual-core dual-thread
> > interAptivs with CPU Idle enabled (e.g. MIPS Creator Ci40) results in
> > mips_cm_lock_other() repeatedly hitting WARN_ON(vp != 0).
> > 
> > There doesn't appear to be anything fundamentally wrong about passing a
> > non-zero VP/VPE number, even if it is a core's region that is locked
> > into the other region before r6, so remove that particular WARN_ON().
> > 
> > Fixes: 68923cdc2eb3 ("MIPS: CM: Add cluster & block args to mips_cm_lock_other()")
> > Signed-off-by: James Hogan <jhogan@xxxxxxxxxx>
> > Reviewed-by: Paul Burton <paul.burton@xxxxxxxx>
> > Cc: Ralf Baechle <ralf@xxxxxxxxxxxxxx>
> > Cc: linux-mips@xxxxxxxxxxxxxx
> > Cc: <stable@xxxxxxxxxxxxxxx> # 4.14+
> > ---
> >  arch/mips/kernel/mips-cm.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/arch/mips/kernel/mips-cm.c b/arch/mips/kernel/mips-cm.c
> > index dd5567b1e305..8f5bd04f320a 100644
> > --- a/arch/mips/kernel/mips-cm.c
> > +++ b/arch/mips/kernel/mips-cm.c
> > @@ -292,7 +292,6 @@ void mips_cm_lock_other(unsigned int cluster, unsigned int core,
> >  				  *this_cpu_ptr(&cm_core_lock_flags));
> >  	} else {
> >  		WARN_ON(cluster != 0);
> > -		WARN_ON(vp != 0);
> 
> I think the reason is that for a while the combination of SMP/CMP with
> MT was at the bottom of priorities and nobody really cared about it so
> a WARN_ON was thrown in.  Which in this case might well itself be a bug!
> 
>   Ralf

Well, the WARN_ON came about because:

 - Originally mips_cm_lock_other() had no VP argument, because the CM
   "other" register in CM < 3 has no VP field. ie. the CM "other" region
   always accesses registers associated with a core rather than a VP(E).

 - With CM >= 3 the "other" register (renamed "redirect") gained a VP
   field & the ability to control the CPC & GIC redirect regions, which
   were formerly controlled by separate "other" registers in the CPC &
   GIC register interfaces. In the GIC case the other/redirect region
   actually targets a VP, not a core.

 - Since the CM < 3 other register has no VP field the WARN_ON made
   sense - if code is actually trying to access registers associated
   with a particular VP then that's a bug because that's simply not
   possible. The warning called out those cases by requiring CM < 3
   paths to explicitly pass VP=0.

That was all well & good until I added the mips_cm_lock_other_cpu()
wrapper for convenience. It passes the CPUs cluster, core & VP(E)
numbers to mips_cm_lock_other() which is problematic if it's used in
paths that may run on CM < 3. There is only one such case right now, in
mips_smp_send_ipi_mask() when powering up a CPU in a non-coherent idle
state. So effectively you have to have the CPS cpuidle driver enabled to
hit this.

A potential alternative to this patch would be to have
mips_cm_lock_other_cpu() check the CM version & use VP=0 for CM < 3, but
I think the utility of that is questionable so dropping the WARN_ON as
James has done seems reasonable.

We actually do test mainline, mips-for-linux-next & our internal kernel
branches a bunch on affected systems, most notably interAptiv in the
context of either the Boston or Ci40 platforms. Unfortunately that
didn't catch the bug here because the cpuidle driver isn't enabled in
the kernel configs that our CI system builds. That's something we'll
change. If that weren't the case then we ought to have caught this once
it hit -next. The patchset for multi-cluster support has also shrunk
quite a lot with the merging of the initial set, so our internal
branches are now closer to what will next be submitted which also helps
since we might spot issues even earlier in tests of those internal
branches.

Having the mips-for-linux-next branch updated regularly throughout the
cycle would help immensely in giving us the best chance of catching bugs
like this in a timely fashion, since when it all happens right before
the merge window we don't get much testing time on the code actually
hitting mainline until after it has gone in.

Anyway, hopefully that post-mortem shines some light on what went wrong
here!

Thanks,
    Paul