On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote: > On 10/16/2012 01:58 PM, Paul Mackerras wrote: > > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: > >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry > >> to change). > > > > No. > > This forces userspace to dedicate a thread for the HPT. Why? Reads never block in any case. > >> > If read() is called again on the fd, it will start again from > >> > +the beginning of the HPT, but will only return HPT entries that have > >> > +changed since they were last read. > >> > + > >> > +Data read or written is structured as a header (8 bytes) followed by a > >> > +series of valid HPT entries (16 bytes) each. The header indicates how > >> > +many valid HPT entries there are and how many invalid entries follow > >> > +the valid entries. The invalid entries are not represented explicitly > >> > +in the stream. The header format is: > >> > + > >> > +struct kvm_get_htab_header { > >> > + __u32 index; > >> > + __u16 n_valid; > >> > + __u16 n_invalid; > >> > +}; > >> > >> This structure forces the kernel to return entries sequentially. Will > >> this block changing the data structure in the future? Or is the > >> hardware spec sufficiently strict that such changes are not realistic? > > > > By "data structure", do you mean the stream format on the file > > descriptor, or the HPT structure? If we want a new stream format, > > then we would define a bit in the flags field of struct > > kvm_get_htab_fd to mean "I want the new stream format". The code > > fails the ioctl if any unknown flag bits are set, so a new userspace > > that wants to use the new format could then detect that it is running > > on an old kernel and fall back to the old format. > > > > The HPT entry format is very unlikely to change in size or basic > > layout (though the architects do redefine some of the bits > > occasionally). > > I meant the internal data structure that holds HPT entries. Oh, that's just an array, and userspace already knows how big it is. > I guess I don't understand the index. Do we expect changes to be in > contiguous ranges? And invalid entries to be contiguous as well? That > doesn't fit with how hash tables work. Does the index represent the > position of the entry within the table, or something else? The index is just the position in the array. Typically, in each group of 8 it will tend to be the low-numbered ones that are valid, since creating an entry usually uses the first empty slot. So I expect that on the first pass, most of the records will represent 8 HPTEs. On subsequent passes, probably most records will represent a single HPTE. > >> > + > >> > +Writes to the fd create HPT entries starting at the index given in the > >> > +header; first `n_valid' valid entries with contents from the data > >> > +written, then `n_invalid' invalid entries, invalidating any previously > >> > +valid entries found. > >> > >> This scheme is a clever, original, and very interesting approach to live > >> migration. That doesn't necessarily mean a NAK, we should see if it > >> makes sense for other migration APIs as well (we currently have > >> difficulties migrating very large/wide guests). > >> > >> What is the typical number of entries in the HPT? Do you have estimates > >> of the change rate? > > > > Typically the HPT would have about a million entries, i.e. it would be > > 16MiB in size. The usual guideline is to make it about 1/64 of the > > maximum amount of RAM the guest could ever have, rounded up to a power > > of two, although we often run with less, say 1/128 or even 1/256. > > 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does > it warrant a live migration protocol? The qemu people I talked to seemed to think so. > > Because it is a hash table, updates tend to be scattered throughout > > the whole table, which is another reason why per-page dirty tracking > > and updates would be pretty inefficient. > > This suggests a stream format that includes the index in every entry. That would amount to dropping the n_valid and n_invalid fields from the current header format. That would be less efficient for the initial pass (assuming we achieve an average n_valid of at least 2 on the initial pass), and probably less efficient for the incremental updates, since a newly-invalidated entry would have to be represented as 16 zero bytes rather than just an 8-byte header with n_valid=0 and n_invalid=1. I'm assuming here that the initial pass would omit invalid entries. > > > > As for the change rate, it depends on the application of course, but > > basically every time the guest changes a PTE in its Linux page tables > > we do the corresponding change to the corresponding HPT entry, so the > > rate can be quite high. Workloads that do a lot of fork, exit, mmap, > > exec, etc. have a high rate of HPT updates. > > If the rate is high enough, then there's no point in a live update. True, but doesn't that argument apply to memory pages as well? Paul. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html