On Wed, 2009-07-08 at 09:14 +0200, Alexander Graf wrote: > >> arch/powerpc/include/asm/kvm_ppc.h | 1 + > > > > .../... > > mh? Just a standard way to say I snipped some of the quote :-) > Yeah, that should be definitely possible. While it's not really > necessary it makes the code smaller, so it's probably worth it ;-). Could also make debugging easier. In fact you should make the whole thing look like an interrupt frame (aka pt_regs + STACK_FRAME_OVERHEAD) and stick in a similar signature than we put in our interrupt frames (see the exception common macro) so we properly see them for what they are in xmon etc... > >> +/* XXX optimize non-volatile loading away */ > >> +kvm_start_lightweight: > >> + > >> + DISABLE_INTERRUPTS BTW. If this is coming from C code, I'd rather have a hard_irq_disable() call in the C code before calling into the asm. > >> + /* This sets the Magic value for the trampoline: > >> + * > >> + * PPC64: SPRG3 |= 1 > >> + */ > >> + setmagc r3 > >> + >From the moment we do that, we must not take an exception until we actually end up in the guest right ? So the code below must not take an SLB miss. However that is not guaranteed I think that your VCPU thingy pointed to by r4 is currently in a bolted SLB entry. On some P5 or later machines, the SLB is effectively volatile: the underlying pHyp hypervisor can crap on it, though it will restore bits of it via the shadow SLB data structure in main memory. However, unless you arrange for the VCPU structure to be in the first 256M of memory, it won't be covered by that shadow. You may want to modify the SLB code when using KVM to also "bolt" the VCPU or delay the flicking of SPRG3 if you can get away with clobbering a GPR ... > > The whole dcbz stuff could probably be a cpufeature block so it > > gets nop'ed out when running on other processors than 970 since > > they don't all support that magic dcbz trick. > > Yeah, I never really understood those cpufeature blocks ... Hehehe :-) There's also the MMU features and FW features btw :-) The base principle is that we stick references to the start and end of the block into an ELF section along with a mask & value of CPU feature bits to compare against. At boot time, if it doesn't match we NOP out everything between start and stop. Recently, Michael Ellerman also improved on it by allowing to have "alternate", ie two implementations of the block of code, the first one in by default, the second one in a separate ELF section, and the second one is copied over the first one (and padded with NOPs, branches are fixed up too) if the CPU features don't match, which allows to have "alternate" implementations of perf. critical asm code (of course, the "default" implementation needs to be larger or equal in size to the "alternate" one). > > Also, I think HID5 > > is a HV reserved register thus you won't be able to do that trick > > when running yourself with MSR:HV=0, for example when running on > > a js2x blade. > > Yes, it is. That's why the HFLAGS bit is only set when HV=1 :-). Ok. This is also something that should only be done on a real 970, 970FX or 970MP processor as others don't have that bit in HID5 afaik. > FAULT_* are basically the registers that store where the guest > faulted. So if the guest triggers a data store interrupt, the > corresponding dar gets stored to a vcpu field, so we don't clobber it > later. Ok. > Yes, the guest runs with PR=1 :-). Right, that was my understanding too but heh, better being sure :-) > I don't think we can easily have Linux running while we're in the > guest context. What if the DEC issues the scheduler, which schedules > off and back again? How would it know where to resume the guest? And > who'd set the magic bit in SPRG3? No, you misunderstood me. But then, I need to better "get" what you are doing. For example, with MOL, the guest is split in two... the part that is in the virtual machine, but also the parts that run as a normal linux process (which do the device emulation etc...). The trick when we take any exception is we context switch back to make it look like we are coming from that part, basically from the magic syscall where the "linux" part of the guest called into the kernel to switch into emulation. I have to get more familiar with how KVM does these things though to provide a more useful feedback. > When running a PPC64 guest things get even worse, as we have to switch > the SLB as well, which is actually the slow part of the entry/exit > code atm. I'm not totally sure we really have to, I need to better understand what you do with the SLB, and that with my own knowledge of what Linux needs, we can probably simplify things quite a bit. For example, most of the Linux host side SLB entries can just be ditched. > Maybe we could work around those problems by integrating things a bit > more, but I doubt it's necessary. Host DEC and EE interrupts shouldn't > really hurt performance that much. Right. Beware that MacOS 9, if you ever want to run that, will trigger shitloads of guest DEC interrupts tho. > What we do here is do a full guest exit cycle and go back to the Linux > handler we came from, so it can handle the interrupt we intercepted. > That way we're in normal kernel code from the point of view of every > other part of Linux. But don't we do that for any interrupt ? I don't quite get why DEC and EE are "special" here... What about machine checks, for excample ? Or system reset ? I understand that you want synchronous interrupts such as FP, altivec, etc... to be routed back to the guest but DEC and EE aren't the only ones that need to be reflected back to Linux are they ? > Maybe I'm calling it wrong? Basically, I want Linux to handle > interrupts :-). And I did a irq_local_disable before, so this is the > asm equivalent of _enable, no? Well, no, if you were to do that you should call raw_local_irq_restore() since we may need to do some "fixups" for example if an interrupt did happen while we were soft-disabled. But then, you should not call into the linux EE or decrementer handler with interrupts enabled in the first place. You should really just make it look like you took the interrupt from the underlying userland process in which the guest runs... Catch me on IRC, I need to better understand your model, and we can sort that out. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm-ppc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html