On 07.02.2011, at 21:15, Scott Wood wrote: > On Mon, 7 Feb 2011 16:43:02 +0100 > Alexander Graf <agraf@xxxxxxx> wrote: > >> On 04.02.2011, at 23:33, Scott Wood wrote: >> >>> On Thu, 3 Feb 2011 10:19:06 +0100 >>> Alexander Graf <agraf@xxxxxxx> wrote: >>> >>>> Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). >>> >>> Maybe, though keeping it in KVM means we can change it whenever we want >>> without having to sync up Qemu and worry about backward compatibility. >> >> Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace. > > If you're talking about actual interface changes, yes. But a change in > how KVM implements things behind the scenes shouldn't break an old > Qemu, unless it's buggy and makes assumptions not permitted by the API. Right :). > >>> How's that different from backing the void pointer up with a different >>> struct depending on the MMU type? We weren't proposing unions. >>> >>> A fixed array does mean you wouldn't have to worry about whether qemu >>> supports the more advanced struct format if fields are added -- >>> you can just unconditionally write it, as long as it's backwards >>> compatible. Unless you hit the limit of the pre-determined array size, >>> that is. And if that gets made higher for future expansion, that's >>> even more data that has to get transferred, before it's really needed. >> >> Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is? >> >> struct request_ppc_tlb { >> int tlb_type; >> int tlb_entries; >> uint64_t __user *tlb_data >> }; > > That's pretty much what the proposed API does -- except it uses a void > pointer instead of uint64_t *. Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time. > >> Would you really want to loop through 16k entries, doing an ioctl for each? > > Not really. The API was modeled after something we did on Topaz where > it's just a function call. But something array-based would have been > awkward without constraining the geometry. > > Now that we're going to constrain the geometry, providing an array-based > get/set would be easy and should definitely be a part of the API. > >> Then performance really would always be an issue. > > For cases where you really need to do a full get/set, yes. > >> I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a "kvm, please translate EA x to RA y" ioctl. But those should come after we cover the big hammer that just copies everything. > > If we add everything at once, it minimizes the possibilities that Qemu > has to deal with -- either the full MMU API is there, or it's not. Don't try to take load off qemu before it started to handle it and shows that it performs badly. One of the big strengths of the architecture with kvm+qemu that we have today is the emulation part. Sure, it duplicates some of the code, but it does help in a lot of other parts too - sometimes even because of the duplication. > BTW, I wonder if we should leave PPC out of the name. It seems like > any arch with a software-visible TLB could use this, since the hw > details are hidden behind the MMU type. Very true. > > How about: > > struct kvmppc_booke_tlb_entry { > union { > __u64 mas0_1; > struct { > __u32 mas0; > __u32 mas1; > }; > }; > __u64 mas2; > union { > __u64 mas7_3 > struct { > __u32 mas7; > __u32 mas3; > }; > }; > __u32 mas8; > __u32 pad; Would it make sense to add some reserved fields or would we just bump up the mmu id? > }; > > struct kvmppc_booke_tlb_params { > /* > * book3e defines 4 TLBs. Individual implementations may have > * fewer. TLBs that do not exist on the target must be configured > * with a size of zero. KVM will adjust TLBnCFG based on the sizes > * configured here, though arrays greater than 2048 entries will > * have TLBnCFG[NENTRY] set to zero. > */ > __u32 tlb_sizes[4]; Add some reserved fields? > }; > > struct kvmppc_booke_tlb_search { Search? I thought we agreed on having a search later, after the full get/set is settled? > struct kvmppc_booke_tlb_entry entry; > union { > __u64 mas5_6; > struct { > __u64 mas5; > __u64 mas6; > }; > }; > }; > > For a mmu type of PPC_BOOKE_NOHV, the mas5 field in > kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry > are present but not supported. > > For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV: > - TLB entries in get/set arrays may be provided in any order, and all > TLBs are get/set at once. Makes sense > - An entry with MAS1[V] = 0 terminates the list early (but there will > be no terminating entry if the full array is valid). On a call to > KVM_GET_TLB, the contents of elemnts after the terminator are undefined. > On a call to KVM_SET_TLB, excess elements beyond the terminating > entry may not be accessed by KVM. Very implementation specific, but ok with me. It's constrained to the BOOKE implementation of that GET/SET anyway. Is this how the hardware works too? > > [Note: Once we implement sregs, Qemu can determine which TLBs are > implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be > unsupported by KVM if its existence is implied by the target CPU] > > KVM_SET_TLB > ----------- > > Capability: KVM_CAP_SW_TLB > Type: vcpu ioctl > Parameters: struct kvm_set_tlb (in) > Returns: 0 on success > -1 on error > > struct kvm_set_tlb { > __u64 params; > __u64 array; > __u32 mmu_type; > }; > > [Note: I used __u64 rather than void * to avoid the need for special > compat handling with 32-bit userspace on a 64-bit kernel -- if the other > way is preferred, that's fine with me] Oh, now I understand what you were proposing :). Sorry. No, this way is sane. > Configures and sets the virtual CPU's TLB array. The "params" and > "array" fields are userspace addresses of mmu-type-specific data > structures. > > For mmu types PPC_BOOKE_NOHV and PPC_BOOKE_HV, the "params" field is of > type "struct kvmppc_booke_tlb_params", and the "array" field points to > an array of type "struct kvmppc_booke_tlb_entry". > > [Note: KVM_SET_TLB with early array termination makes a separate > invalidate call unnecessary.] > > KVM_GET_TLB > ----------- > > Capability: KVM_CAP_SW_TLB > Type: vcpu ioctl > Parameters: void pointer (out) > Returns: 0 on success > -1 on error > > Reads the TLB array from a virtual CPU. A successful call to > KVM_SET_TLB must have been previously made on this vcpu. The argument > must point to space for an array of the size and type of TLB entry > structs configured by the most recent successful call to KVM_SET_TLB. > > For mmu types BOOKE_NOHV and BOOKE_HV, the array is of type "struct > kvmppc_booke_tlb_entry", and must hold a number of entries equal to > the sum of the elements of tlb_sizes in the most recent successful > TLB configuration call. We should add some sort of safety net here. The caller really should pass in how big that pointer is. > > > KVM_SEARCH_TLB > -------------- > > Capability: KVM_CAP_SW_TLB > Type: vcpu ioctl > Parameters: void pointer (in/out) > Returns: 0 on success > -1 on error > > Searches the TLB array of a virtual CPU for an entry matching > mmu-type-specific parameters. A successful call to KVM_SET_TLB must > have been previously made on this vcpu. > > For mmu types BOOKE_NOHV and BOOKE_HV, the argument must point to a > struct of type "kvmppc_booke_tlb_search". The search operates as a tlbsx > instruction. > > [Note: there currently exists a BookE implementation of KVM_TRANSLATE, > but the way it interprets the address is broken on 64-bit, and seems to > be confusing PPC's notion of a virtual address with what was most likely > intended by the x86ish API. If nothing yet uses it to be broken by a > change, we may want to reimplement that using the address input as just > an effective address, as it would be interpreted by the current state > of the vcpu. This would be a way to provide a non-hw-specific > mechanism for simple virtual->physical translation for debuggers and > such, as long as the caller doesn't care about the x86ish attributes.]. I'm not aware of any callers of KVM_TRANSLATE on PPC, so we're free to change the semantics. For gdb a simple ea->pa translation with valid bit should be enough. The original semantic of KVM_TRANSLATE is on the current context anyways. Sorry for the late reply :) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html