Re: [PATCH 06/23] Add 970 highmem asm code

Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx> · Wed, 08 Jul 2009 17:37:44 +1000

On Wed, 2009-07-08 at 09:14 +0200, Alexander Graf wrote:

> >> arch/powerpc/include/asm/kvm_ppc.h |    1 +
> >
> > .../...
> 
> mh?

Just a standard way to say I snipped some of the quote :-)

> Yeah, that should be definitely possible. While it's not really  
> necessary it makes the code smaller, so it's probably worth it ;-).

Could also make debugging easier. In fact you should make the whole
thing look like an interrupt frame (aka pt_regs + STACK_FRAME_OVERHEAD)
and stick in a similar signature than we put in our interrupt frames
(see the exception common macro) so we properly see them for what they
are in xmon etc...

> >> +/* XXX optimize non-volatile loading away */
> >> +kvm_start_lightweight:
> >> +
> >> +	DISABLE_INTERRUPTS

BTW. If this is coming from C code, I'd rather have a hard_irq_disable()
call in the C code before calling into the asm.

> >> +	/* This sets the Magic value for the trampoline:
> >> +	 *
> >> +	 * PPC64: SPRG3 |= 1
> >> +	 */
> >> +	setmagc	r3
> >> +

>From the moment we do that, we must not take an exception until we
actually end up in the guest right ? So the code below must not
take an SLB miss.

However that is not guaranteed I think that your VCPU thingy pointed to
by r4 is currently in a bolted SLB entry. On some P5 or later machines,
the SLB is effectively volatile: the underlying pHyp hypervisor can crap
on it, though it will restore bits of it via the shadow SLB data
structure in main memory. However, unless you arrange for the VCPU
structure to be in the first 256M of memory, it won't be covered by that
shadow. You may want to modify the SLB code when using KVM to also
"bolt" the VCPU or delay the flicking of SPRG3 if you can get away with
clobbering a GPR ...

> > The whole dcbz stuff could probably be a cpufeature block so it
> > gets nop'ed out when running on other processors than 970 since
> > they don't all support that magic dcbz trick.
> 
> Yeah, I never really understood those cpufeature blocks ...

Hehehe :-) There's also the MMU features and FW features btw :-) The
base principle is that we stick references to the start and end of the
block into an ELF section along with a mask & value of CPU feature bits
to compare against. At boot time, if it doesn't match we NOP out
everything between start and stop. Recently, Michael Ellerman also
improved on it by allowing to have "alternate", ie two implementations
of the block of code, the first one in by default, the second one in a
separate ELF section, and the second one is copied over the first one
(and padded with NOPs, branches are fixed up too) if the CPU features
don't match, which allows to have "alternate" implementations of perf.
critical asm code (of course, the "default" implementation needs to be
larger or equal in size to the "alternate" one).

> > Also, I think HID5
> > is a HV reserved register thus you won't be able to do that trick
> > when running yourself with MSR:HV=0, for example when running on
> > a js2x blade.
> 
> Yes, it is. That's why the HFLAGS bit is only set when HV=1 :-).

Ok. This is also something that should only be done on a real 970, 970FX
or 970MP processor as others don't have that bit in HID5 afaik.

> FAULT_* are basically the registers that store where the guest  
> faulted. So if the guest triggers a data store interrupt, the  
> corresponding dar gets stored to a vcpu field, so we don't clobber it  
> later.

Ok.

> Yes, the guest runs with PR=1 :-).

Right, that was my understanding too but heh, better being sure :-)

> I don't think we can easily have Linux running while we're in the  
> guest context. What if the DEC issues the scheduler, which schedules  
> off and back again? How would it know where to resume the guest? And  
> who'd set the magic bit in SPRG3?

No, you misunderstood me. But then, I need to better "get" what you are
doing. For example, with MOL, the guest is split in two... the part that
is in the virtual machine, but also the parts that run as a normal linux
process (which do the device emulation etc...). The trick when we take
any exception is we context switch back to make it look like we are
coming from that part, basically from the magic syscall where the
"linux" part of the guest called into the kernel to switch into
emulation.

I have to get more familiar with how KVM does these things though to
provide a more useful feedback.

> When running a PPC64 guest things get even worse, as we have to switch  
> the SLB as well, which is actually the slow part of the entry/exit  
> code atm.

I'm not totally sure we really have to, I need to better understand what
you do with the SLB, and that with my own knowledge of what Linux needs,
we can probably simplify things quite a bit. For example, most of the
Linux host side SLB entries can just be ditched.

> Maybe we could work around those problems by integrating things a bit  
> more, but I doubt it's necessary. Host DEC and EE interrupts shouldn't  
> really hurt performance that much.

Right. Beware that MacOS 9, if you ever want to run that, will trigger
shitloads of guest DEC interrupts tho.

> What we do here is do a full guest exit cycle and go back to the Linux  
> handler we came from, so it can handle the interrupt we intercepted.  
> That way we're in normal kernel code from the point of view of every  
> other part of Linux.

But don't we do that for any interrupt ? I don't quite get why DEC and
EE are "special" here...

What about machine checks, for excample ? Or system reset ? I understand
that you want synchronous interrupts such as FP, altivec, etc... to be
routed back to the guest but DEC and EE aren't the only ones that need
to be reflected back to Linux are they ?

> Maybe I'm calling it wrong? Basically, I want Linux to handle  
> interrupts :-). And I did a irq_local_disable before, so this is the  
> asm equivalent of _enable, no?

Well, no, if you were to do that you should call raw_local_irq_restore()
since we may need to do some "fixups" for example if an interrupt did
happen while we were soft-disabled.

But then, you should not call into the linux EE or decrementer handler
with interrupts enabled in the first place. You should really just make
it look like you took the interrupt from the underlying userland process
in which the guest runs...

Catch me on IRC, I need to better understand your model, and we can sort
that out.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html