Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

Paul Mackerras <paulus@xxxxxxxxx> · Wed, 17 Oct 2012 08:52:53 +1100

On Tue, Oct 16, 2012 at 03:06:33PM +0200, Avi Kivity wrote:
> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
> > On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
> >> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
> >> to change).
> > 
> > No.
> 
> This forces userspace to dedicate a thread for the HPT.

Why? Reads never block in any case.

> >> > If read() is called again on the fd, it will start again from
> >> > +the beginning of the HPT, but will only return HPT entries that have
> >> > +changed since they were last read.
> >> > +
> >> > +Data read or written is structured as a header (8 bytes) followed by a
> >> > +series of valid HPT entries (16 bytes) each.  The header indicates how
> >> > +many valid HPT entries there are and how many invalid entries follow
> >> > +the valid entries.  The invalid entries are not represented explicitly
> >> > +in the stream.  The header format is:
> >> > +
> >> > +struct kvm_get_htab_header {
> >> > +	__u32	index;
> >> > +	__u16	n_valid;
> >> > +	__u16	n_invalid;
> >> > +};
> >> 
> >> This structure forces the kernel to return entries sequentially.  Will
> >> this block changing the data structure in the future?  Or is the
> >> hardware spec sufficiently strict that such changes are not realistic?
> > 
> > By "data structure", do you mean the stream format on the file
> > descriptor, or the HPT structure?  If we want a new stream format,
> > then we would define a bit in the flags field of struct
> > kvm_get_htab_fd to mean "I want the new stream format".  The code
> > fails the ioctl if any unknown flag bits are set, so a new userspace
> > that wants to use the new format could then detect that it is running
> > on an old kernel and fall back to the old format.
> > 
> > The HPT entry format is very unlikely to change in size or basic
> > layout (though the architects do redefine some of the bits
> > occasionally).
> 
> I meant the internal data structure that holds HPT entries.

Oh, that's just an array, and userspace already knows how big it is.

> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?

The index is just the position in the array.  Typically, in each group
of 8 it will tend to be the low-numbered ones that are valid, since
creating an entry usually uses the first empty slot.  So I expect that
on the first pass, most of the records will represent 8 HPTEs.  On
subsequent passes, probably most records will represent a single HPTE.

> >> > +
> >> > +Writes to the fd create HPT entries starting at the index given in the
> >> > +header; first `n_valid' valid entries with contents from the data
> >> > +written, then `n_invalid' invalid entries, invalidating any previously
> >> > +valid entries found.
> >> 
> >> This scheme is a clever, original, and very interesting approach to live
> >> migration.  That doesn't necessarily mean a NAK, we should see if it
> >> makes sense for other migration APIs as well (we currently have
> >> difficulties migrating very large/wide guests).
> >> 
> >> What is the typical number of entries in the HPT?  Do you have estimates
> >> of the change rate?
> > 
> > Typically the HPT would have about a million entries, i.e. it would be
> > 16MiB in size.  The usual guideline is to make it about 1/64 of the
> > maximum amount of RAM the guest could ever have, rounded up to a power
> > of two, although we often run with less, say 1/128 or even 1/256.
> 
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

The qemu people I talked to seemed to think so.

> > Because it is a hash table, updates tend to be scattered throughout
> > the whole table, which is another reason why per-page dirty tracking
> > and updates would be pretty inefficient.
> 
> This suggests a stream format that includes the index in every entry.

That would amount to dropping the n_valid and n_invalid fields from
the current header format.  That would be less efficient for the
initial pass (assuming we achieve an average n_valid of at least 2 on
the initial pass), and probably less efficient for the incremental
updates, since a newly-invalidated entry would have to be represented
as 16 zero bytes rather than just an 8-byte header with n_valid=0 and
n_invalid=1.  I'm assuming here that the initial pass would omit
invalid entries.

> > 
> > As for the change rate, it depends on the application of course, but
> > basically every time the guest changes a PTE in its Linux page tables
> > we do the corresponding change to the corresponding HPT entry, so the
> > rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
> > exec, etc. have a high rate of HPT updates.
> 
> If the rate is high enough, then there's no point in a live update.

True, but doesn't that argument apply to memory pages as well?

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html