Re: [PATCH 1/5] KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking

Paul Mackerras <paulus@xxxxxxxxx> · Sat, 24 Nov 2012 08:42:10 +1100

On Fri, Nov 23, 2012 at 03:13:09PM +0100, Alexander Graf wrote:
> 
> On 22.11.2012, at 10:25, Paul Mackerras wrote:
> 
> > +	/* Do they have an SLB shadow buffer registered? */
> > +	slb = vcpu->arch.slb_shadow.pinned_addr;
> > +	if (!slb)
> > +		return;
> 
> Mind to explain this case? What happens here? Do we leave the guest with an empty SLB? Why would this ever happen? What happens next as soon as we go back into the guest?

Yes, we leave the guest with an empty SLB, the access gets retried and
this time the guest gets an SLB miss interrupt, which it can hopefully
handle using an SLB miss handler that runs entirely in real mode.
This could happen for instance while the guest is in SLOF or yaboot or
some other code that runs basically in real mode but occasionally
turns the MMU on for some accesses, and happens to have a bug where it
creates a duplicate SLB entry.

> > +	/* Sanity check */
> > +	n = slb->persistent;
> > +	if (n > SLB_MIN_SIZE)
> > +		n = SLB_MIN_SIZE;
> 
> Please use a max() macro here.

OK.

> > +	rb = 0x800;		/* IS field = 0b10, flush congruence class */
> > +	for (i = 0; i < 128; ++i) {
> 
> Please put up a #define for this. POWER7_TLB_SIZE or so. Is there any way to fetch that number from an SPR? I don't really want to have a p7+ and a p8 function in here too ;).
> 
> > +		asm volatile("tlbiel %0" : : "r" (rb));
> > +		rb += 0x1000;
> 
> I assume this also means (1 << TLBIE_ENTRY_SHIFT)? Would be nice to keep the code readable without guessing :).

The 0x800 and 0x1000 are taken from the architecture - it defines
fields in the RB value for the flush type and TLB index.  The 128 is
POWER7-specific and isn't in any SPR that I know of.  Eventually we'll
probably have to put it (the number of TLB congruence classes) in the
cputable, but for now I'll just do a define.

> So I take it that p7 does not implement tlbia?

Correct.

> So we never return 0? How about ECC errors and the likes? Wouldn't those also be #MCs that the host needs to handle?

Yes, true.  In fact the OPAL firmware gets to see the machine checks
before we do (see the opal_register_exception_handler() calls in
arch/powerpc/platforms/powernv/opal.c), so it should have already
handled recoverable things like L1 cache parity errors.

I'll make the function return 0 if there's an error that it doesn't
know about.

> > 	ld	r8, HSTATE_VMHANDLER(r13)
> > 	ld	r7, HSTATE_HOST_MSR(r13)
> > 
> > 	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
> > +BEGIN_FTR_SECTION
> > 	beq	11f
> > -	cmpwi	r12, BOOK3S_INTERRUPT_MACHINE_CHECK
> > +END_FTR_SECTION_IFSET(CPU_FTR_ARCH_206)
> 
> Mind to explain the logic that happens here? Do we get external interrupts on 970? If not, the cmpwi should also be inside the FTR section. Also, if we do a beq here, why do the beqctr below again?

I was making it not call the host kernel machine check handler if it
was a machine check that pulled us out of the guest.  In fact we
probably do still want to call the handler, but we don't want to jump
to 0x200, since that has been patched by OPAL, and we don't want to
make OPAL think we just got another machine check.  Instead we would
need to jump to machine_check_pSeries.

The feature section is because POWER7 sets HSRR0/1 on external
interrupts, whereas PPC970 sets SRR0/1.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html