On Thu, 9 Aug 2018 20:15:04 +0530 Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx> wrote: > Hello Nicholas, > On Fri, Aug 03, 2018 at 02:13:50PM +1000, Nicholas Piggin wrote: > > Reimplement Book3S idle code in C, moving POWER7/8/9 implementation > > speific HV idle code to the powernv platform code. Generic book3s > > assembly stubs are kept in common code and used only to save and > > restore the stack frame and non-volatile GPRs just before going to > > idle and just after waking up, which can return into C code. > > > > Moving the HMI, SPR, OPAL, locking, etc. to C is the only real > > way this stuff will cope with non-trivial new CPU implementation > > details, firmware changes, etc., without becoming unmaintainable. > > > > This is not a strict translation to C code, there are some > > differences. > > > > - Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs, > > but saves and restores them all explicitly. > > > > - The optimisation where EC=ESL=0 idle modes did not have to save > > GPRs or change MSR is restored, because it's now simple to do. > > State loss modes that did not actually lose GPRs can use this > > optimization too. > > > > - KVM secondary entry code is now more of a call/return style > > rather than spaghetti. nap_state_lost is not required beccause > > KVM always returns via NVGPR restorig path. > > > > This seems pretty solid, so needs more review and testing now. The > > KVM changes are pretty significant and complicated. POWER7 needs to > > be tested too. > > > > Open question: > > - Why does P9 restore some of the PMU SPRs (but not others), and > > P8 only zeroes them? > > We are restoring MMCR0 (from the value saved in the stack) and MMCR1, > MMCR2, and MMCRA in the stop_sprs in PACA. We saw that MMCRH and MMCRC > are cleared on both POWER8 and POWER9. Hence we didn't restore > them. MMCRS is being initialized by the KVM code. > > Is there anything apart from these that need to be restored ? No I'm wondering why it is we restore those on POWER9? POWER8 does not restore them, only zeroes. What is the difference with POWER9? I will leave that in for now so we don't change too much with one patch, but it would be nice to document a bit better the reasons for saving or clearing SPRs. > > > > > Since RFC v1: > > - Now tested and working with POWER9 hash and radix. > > - KVM support added. This took a bit of work to untangle and might > > still have some issues, but POWER9 seems to work including hash on > > radix with dependent threads mode. > > - This snowballed a bit because of KVM and other details making it > > not feasible to leave POWER7/8 code alone. That's only half done > > at the moment. > > - So far this trades about 800 lines of asm for 500 of C. With POWER7/8 > > support done it might be another hundred or so lines of C. > > > > Since RFC v2: > > - Fixed deep state SLB reloading > > - Now tested and working with POWER8. > > - Accounted for most feedback. > > > > Since RFC v3: > > - Rebased to powerpc merge + idle state bugfix > > - Split SLB flush/restore code out and shared with MCE code (pseries > > MCE patches can also use). > > - More testing on POWER8 including KVM with secondaries. > > - Performance testing looks good. EC=ESL=0 is about 5% faster, other > > stop states look a little faster too. > > - Adjusted SPR saving to handler POWER7, haven't tested it. > > > This patch looks good to me. > > A couple of comments below. > > > --- > > [... snip ..] > > > @@ -178,23 +177,30 @@ struct paca_struct { > > #endif > > > > #ifdef CONFIG_PPC_POWERNV > > - /* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */ > > - u32 *core_idle_state_ptr; > > - u8 thread_idle_state; /* PNV_THREAD_RUNNING/NAP/SLEEP */ > > - /* Mask to indicate thread id in core */ > > - u8 thread_mask; > > - /* Mask to denote subcore sibling threads */ > > - u8 subcore_sibling_mask; > > - /* Flag to request this thread not to stop */ > > - atomic_t dont_stop; > > - /* The PSSCR value that the kernel requested before going to stop */ > > - u64 requested_psscr; > > + /* PowerNV idle fields */ > > + /* PNV_CORE_IDLE_* bits, all siblings work on thread 0 paca */ > > + unsigned long idle_state; > > + union { > > + /* P7/P8 specific fields */ > > + struct { > > + /* PNV_THREAD_RUNNING/NAP/SLEEP */ > > + u8 thread_idle_state; > > + /* Mask to indicate thread id in core */ > > + u8 thread_mask; > > This is no longer needed. We can get this from cpu_thread_in_core() > from the C code. > > The only place where we are currently using this is to DUMP the value > of the thread_mask from xmon but not anywhere else in the idle entry > code. Good catch, removed it. > > + /* Mask to denote subcore sibling threads */ > > + u8 subcore_sibling_mask; > > + }; > > > > - /* > > - * Save area for additional SPRs that need to be > > - * saved/restored during cpuidle stop. > > - */ > > - struct stop_sprs stop_sprs; > > + /* P9 specific fields */ > > + struct { > > +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE > > + /* The PSSCR value that the kernel requested before going to stop */ > > + u64 requested_psscr; > > + /* Flag to request this thread not to stop */ > > + atomic_t dont_stop; > > +#endif > > + }; > > + }; > > #endif > [..snip..] > > > --- a/arch/powerpc/kernel/exceptions-64s.S > > +++ b/arch/powerpc/kernel/exceptions-64s.S > > @@ -136,8 +136,9 @@ TRAMP_KVM(PACA_EXNMI, 0x100) > > > > #ifdef CONFIG_PPC_P7_NAP > > EXC_COMMON_BEGIN(system_reset_idle_common) > > - mfspr r12,SPRN_SRR1 > > - b pnv_powersave_wakeup > > + mfspr r3,SPRN_SRR1 > > + bltlr cr3 /* no state loss, return to idle caller */ > > So, if we are in an ESL=EC=0 stop, and we get an xscom SRESET, > I guess the expected value in SPRN_SRR1[46:47] will be > SRR1_WS_NOLOSS. Yes it should, because ESL=0. > > In this case, though the LR would correspond to the caller of > isa3_idle_stop_noloss(), r3 would have SPRN_SRR1 as opposed to 0 which > is what we would have returned from isa3_idle_stop_noloss(). > > Do we use the value to service the NMI ? Yes. System reset exception is cleared when the interrupt is delivered so we can't ignore it. > > > + b idle_return_gpr_loss > > #endif > > > > /* > > @@ -416,17 +417,17 @@ EXC_COMMON_BEGIN(machine_check_idle_common) > > * Then decrement MCE nesting after finishing with the stack. > > */ > > ld r3,_MSR(r1) > > + ld r4,_LINK(r1) > > > > lhz r11,PACA_IN_MCE(r13) > > subi r11,r11,1 > > sth r11,PACA_IN_MCE(r13) > > > > - /* Turn off the RI bit because SRR1 is used by idle wakeup code. */ > > - /* Recoverability could be improved by reducing the use of SRR1. */ > > - li r11,0 > > - mtmsrd r11,1 > > - > > - b pnv_powersave_wakeup_mce > > + mtlr r4 > > + rlwinm r10,r3,47-31,30,31 > > + cmpwi cr3,r10,2 > > Can we use SRR1_WS_GPRLOSS here as well as in IDLETEST() ? You mean cmpwi cr3,r10,SRR1_WS_GPRLOSS >> (63-47) > > > + bltlr cr3 /* no state loss, return to idle caller */ > > + b idle_return_gpr_loss > > [..snip..] > > > +_GLOBAL(isa206_idle_insn_mayloss) > > + std r1,PACAR1(r13) > > + mflr r4 > > + mfcr r5 > > + /* use stack red zone rather than a new frame */ > > + addi r6,r1,-INT_FRAME_SIZE > > + SAVE_GPR(2, r6) > > + SAVE_NVGPRS(r6) > > + std r4,_LINK(r6) > > + std r5,_CCR(r6) > > + cmpwi r3,PNV_THREAD_NAP > > + bne 1f > > + IDLE_STATE_ENTER_SEQ_NORET(PPC_NAP) > > We can have the following here, since we don't expect to return from > nap. > > b . /* catch bugs */" > > > +1: cmpwi r3,PNV_THREAD_SLEEP Sure, added. > [..snip..] > > > +static unsigned long power7_idle_insn(unsigned long type) > > +{ > > + int cpu = raw_smp_processor_id(); > > + int first = cpu_first_thread_sibling(cpu); > > + unsigned long thread = 1UL << cpu_thread_in_core(cpu); > > + unsigned long *state = &paca_ptrs[first]->idle_state; > > + unsigned long srr1; > > + bool full_winkle; > > + struct p7_sprs sprs; > > + bool sprs_saved = false; > > + int rc; > > + > > + memset(&sprs, 0, sizeof(sprs)); > > + > > + if (unlikely(type != PNV_THREAD_NAP)) { > > + atomic_lock_thread_idle(); > > + > > + BUG_ON(!(*state & thread)); > > + *state &= ~thread; > > + > > + if (power7_fastsleep_workaround_entry) { > > + if ((*state & ((1 << threads_per_core) - 1)) == 0) { > > + rc = opal_config_cpu_idle_state( > > + OPAL_CONFIG_IDLE_FASTSLEEP, > > + OPAL_CONFIG_IDLE_APPLY); > > + BUG_ON(rc); > > + } > > + } > > + > > + if (type == PNV_THREAD_WINKLE) { > > + sprs.tscr = mfspr(SPRN_TSCR); > > + sprs.worc = mfspr(SPRN_WORC); > > + > > + sprs.sdr1 = mfspr(SPRN_SDR1); > > + sprs.rpr = mfspr(SPRN_RPR); > > + sprs.amor = mfspr(SPRN_AMOR); > > + > > + sprs.lpcr = mfspr(SPRN_LPCR); > > + if (cpu_has_feature(CPU_FTR_ARCH_207S)) { > > + sprs.hfscr = mfspr(SPRN_HFSCR); > > + sprs.fscr = mfspr(SPRN_FSCR); > > + } > > + sprs.purr = mfspr(SPRN_PURR); > > + sprs.spurr = mfspr(SPRN_SPURR); > > + sprs.dscr = mfspr(SPRN_DSCR); > > + sprs.wort = mfspr(SPRN_WORT); > > + > > + sprs_saved = true; > > + > > + /* > > + * Increment winkle counter and set all winkle bits if > > + * all threads are winkling. This allows wakeup side to > > + * distinguish between fast sleep and winkle state > > + * loss. Fast sleep still has to resync the timebase so > > + * this may not be a really big win. > > + */ > > + *state += 1 << PNV_CORE_IDLE_WINKLE_COUNT_SHIFT; > > + if ((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) >> PNV_CORE_IDLE_WINKLE_COUNT_SHIFT == threads_per_core) > > + *state |= PNV_CORE_IDLE_THREAD_WINKLE_BITS; > > + WARN_ON((*state & PNV_CORE_IDLE_WINKLE_COUNT_BITS) == 0); > > + } > > + > > + atomic_unlock_thread_idle(); > > + } > > + > > Could we add the following for debug purposes. We are already dumping > thread_idle_state in xmon. > > paca->thread_idle_state = type; > > > + srr1 = isa206_idle_insn_mayloss(type); > > And the following: > paca->thread_idle_state = PNV_THREAD_RUNNING; Yes I'll add that. > > Apart from debugging purpose, IIRC, on some versions of POWER8, > SRR1[46:47] wasn't being updated to 0b00 when we get an xscom SRESET > while in running state. Thus, the only way to distinguish whether we > entered 0x100 is due to a nap/fastsleep/winkle wakeup or due to an NMI > caused when the processor wasn't in idle state was > paca->thread_idle_state. We don't actually do anything with that today, do we? Is that one of the reasons we don't use xscom SRESET on POWER8? > > > + > > + WARN_ON_ONCE(!srr1); > > + WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR)); > > + > > + if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI)) > [..snip..] > > > +static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on) > > +{ > > + int cpu = raw_smp_processor_id(); > > + int first = cpu_first_thread_sibling(cpu); > > + unsigned long *state = &paca_ptrs[first]->idle_state; > > + unsigned long srr1; > > + unsigned long mmcr0 = 0; > > + struct p9_sprs sprs; > > + bool sprs_saved = false; > > + > > + memset(&sprs, 0, sizeof(sprs)); > > + > > + if (!(psscr & (PSSCR_EC|PSSCR_ESL))) { > > + BUG_ON(!mmu_on); > > + > > + /* > > + * Wake synchronously. SRESET via xscom may still cause > > + * a 0x100 powersave wakeup with SRR1 reason! > > + */ > > + srr1 = isa3_idle_stop_noloss(psscr); > > + if (likely(!srr1)) > > + return 0; > > + > > We come here if if we were woken up from a ESL=0 stop by a xscom > SRESET. Where would the SRESET be handled ? In irq_set_pending_from_srr1(), the same as EC=1 states. > > > + /* > > + * Registers not saved, can't recover! > > + * This would be a hardware bug > > + */ > > + BUG_ON((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS); > > + > > + goto out; > > + } > > + > > > The patch looks good otherwise, especially the idle_book3s.S :-) > > Reviewed-by: Gautham R. Shenoy <ego@xxxxxxxxxxxxxxxxxx> Thank you for all the reviews Thanks, Nick