On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: > On 10/16/2012 06:01 AM, Paul Mackerras wrote: > > +4.78 KVM_PPC_GET_HTAB_FD > > + > > +Capability: KVM_CAP_PPC_HTAB_FD > > +Architectures: powerpc > > +Type: vm ioctl > > +Parameters: Pointer to struct kvm_get_htab_fd (in) > > +Returns: file descriptor number (>= 0) on success, -1 on error > > + > > +This returns a file descriptor that can be used either to read out the > > +entries in the guest's hashed page table (HPT), or to write entries to > > +initialize the HPT. The returned fd can only be written to if the > > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and > > +can only be read if that bit is clear. The argument struct looks like > > +this: > > + > > +/* For KVM_PPC_GET_HTAB_FD */ > > +struct kvm_get_htab_fd { > > + __u64 flags; > > + __u64 start_index; > > +}; > > + > > +/* Values for kvm_get_htab_fd.flags */ > > +#define KVM_GET_HTAB_BOLTED_ONLY ((__u64)0x1) > > +#define KVM_GET_HTAB_WRITE ((__u64)0x2) > > + > > +The `start_index' field gives the index in the HPT of the entry at > > +which to start reading. It is ignored when writing. > > + > > +Reads on the fd will initially supply information about all > > +"interesting" HPT entries. Interesting entries are those with the > > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise > > +all entries. When the end of the HPT is reached, the read() will > > +return. > > What happens if the read buffer is smaller than the HPT size? That's fine; the read stops when it has filled the buffer and a subsequent read will continue from where the previous one finished. > What happens if the read buffer size is not a multiple of entry size? Then we don't use the last few bytes of the buffer. The read() call returns the number of bytes that were filled in, of course. In any case, the header size is 8 bytes and the HPT entry size is 16 bytes, so the number of bytes filled in won't necessarily be a multiple of 16 bytes. > Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry > to change). No. > > If read() is called again on the fd, it will start again from > > +the beginning of the HPT, but will only return HPT entries that have > > +changed since they were last read. > > + > > +Data read or written is structured as a header (8 bytes) followed by a > > +series of valid HPT entries (16 bytes) each. The header indicates how > > +many valid HPT entries there are and how many invalid entries follow > > +the valid entries. The invalid entries are not represented explicitly > > +in the stream. The header format is: > > + > > +struct kvm_get_htab_header { > > + __u32 index; > > + __u16 n_valid; > > + __u16 n_invalid; > > +}; > > This structure forces the kernel to return entries sequentially. Will > this block changing the data structure in the future? Or is the > hardware spec sufficiently strict that such changes are not realistic? By "data structure", do you mean the stream format on the file descriptor, or the HPT structure? If we want a new stream format, then we would define a bit in the flags field of struct kvm_get_htab_fd to mean "I want the new stream format". The code fails the ioctl if any unknown flag bits are set, so a new userspace that wants to use the new format could then detect that it is running on an old kernel and fall back to the old format. The HPT entry format is very unlikely to change in size or basic layout (though the architects do redefine some of the bits occasionally). > > + > > +Writes to the fd create HPT entries starting at the index given in the > > +header; first `n_valid' valid entries with contents from the data > > +written, then `n_invalid' invalid entries, invalidating any previously > > +valid entries found. > > This scheme is a clever, original, and very interesting approach to live > migration. That doesn't necessarily mean a NAK, we should see if it > makes sense for other migration APIs as well (we currently have > difficulties migrating very large/wide guests). > > What is the typical number of entries in the HPT? Do you have estimates > of the change rate? Typically the HPT would have about a million entries, i.e. it would be 16MiB in size. The usual guideline is to make it about 1/64 of the maximum amount of RAM the guest could ever have, rounded up to a power of two, although we often run with less, say 1/128 or even 1/256. Because it is a hash table, updates tend to be scattered throughout the whole table, which is another reason why per-page dirty tracking and updates would be pretty inefficient. As for the change rate, it depends on the application of course, but basically every time the guest changes a PTE in its Linux page tables we do the corresponding change to the corresponding HPT entry, so the rate can be quite high. Workloads that do a lot of fork, exit, mmap, exec, etc. have a high rate of HPT updates. > Suppose new hardware arrives that supports nesting HPTs, so that kvm is > no longer synchronously aware of the guest HPT (similar to how NPT/EPT > made kvm unaware of guest virtual->physical translations on x86). How > will we deal with that? But I guess this will be a > non-guest-transparent and non-userspace-transparent change, unlike > NPT/EPT, so a userspace ABI addition will be needed anyway). Nested HPTs or other changes to the MMU architecture would certainly need new guest kernels and new support in KVM. With a nested approach, the guest-side MMU data structures (HPT or whatever) would presumably be in guest memory and thus be handled along with all the other guest memory, while the host-side MMU data structures would not need to be saved, so from the migration point of view that would make it all a lot simpler. Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html