On 10/16/2012 11:52 PM, Paul Mackerras wrote: > On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote: >> On 10/16/2012 01:58 PM, Paul Mackerras wrote: >> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote: >> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry >> >> to change). >> > >> > No. >> >> This forces userspace to dedicate a thread for the HPT. > > Why? Reads never block in any case. Ok. This parallels KVM_GET_DIRTY_LOG. >> >> I meant the internal data structure that holds HPT entries. > > Oh, that's just an array, and userspace already knows how big it is. > >> I guess I don't understand the index. Do we expect changes to be in >> contiguous ranges? And invalid entries to be contiguous as well? That >> doesn't fit with how hash tables work. Does the index represent the >> position of the entry within the table, or something else? > > The index is just the position in the array. Typically, in each group > of 8 it will tend to be the low-numbered ones that are valid, since > creating an entry usually uses the first empty slot. So I expect that > on the first pass, most of the records will represent 8 HPTEs. On > subsequent passes, probably most records will represent a single HPTE. So it's a form of RLE compression. Ok. >> >> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE. Does >> it warrant a live migration protocol? > > The qemu people I talked to seemed to think so. > >> > Because it is a hash table, updates tend to be scattered throughout >> > the whole table, which is another reason why per-page dirty tracking >> > and updates would be pretty inefficient. >> >> This suggests a stream format that includes the index in every entry. > > That would amount to dropping the n_valid and n_invalid fields from > the current header format. That would be less efficient for the > initial pass (assuming we achieve an average n_valid of at least 2 on > the initial pass), and probably less efficient for the incremental > updates, since a newly-invalidated entry would have to be represented > as 16 zero bytes rather than just an 8-byte header with n_valid=0 and > n_invalid=1. I'm assuming here that the initial pass would omit > invalid entries. I agree. But let's have some measurements to make sure. > >> > >> > As for the change rate, it depends on the application of course, but >> > basically every time the guest changes a PTE in its Linux page tables >> > we do the corresponding change to the corresponding HPT entry, so the >> > rate can be quite high. Workloads that do a lot of fork, exit, mmap, >> > exec, etc. have a high rate of HPT updates. >> >> If the rate is high enough, then there's no point in a live update. > > True, but doesn't that argument apply to memory pages as well? In some cases it does. The question is what happens in practice. If you migrate a kernel build, how many entries are sent in the guest stopped phase? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html