Re: [PATCH 5/5] KVM: PPC: Book3S HV: Provide a method for userspace to read and write the HPT

Anthony Liguori <anthony@xxxxxxxxxxxxx> · Tue, 16 Oct 2012 15:03:00 -0500

Avi Kivity <avi@xxxxxxxxxx> writes:

> On 10/16/2012 01:58 PM, Paul Mackerras wrote:
>> On Tue, Oct 16, 2012 at 12:06:58PM +0200, Avi Kivity wrote:
>>> On 10/16/2012 06:01 AM, Paul Mackerras wrote:
>>> > +4.78 KVM_PPC_GET_HTAB_FD
>>> > +
>>> > +Capability: KVM_CAP_PPC_HTAB_FD
>>> > +Architectures: powerpc
>>> > +Type: vm ioctl
>>> > +Parameters: Pointer to struct kvm_get_htab_fd (in)
>>> > +Returns: file descriptor number (>= 0) on success, -1 on error
>>> > +
>>> > +This returns a file descriptor that can be used either to read out the
>>> > +entries in the guest's hashed page table (HPT), or to write entries to
>>> > +initialize the HPT.  The returned fd can only be written to if the
>>> > +KVM_GET_HTAB_WRITE bit is set in the flags field of the argument, and
>>> > +can only be read if that bit is clear.  The argument struct looks like
>>> > +this:
>>> > +
>>> > +/* For KVM_PPC_GET_HTAB_FD */
>>> > +struct kvm_get_htab_fd {
>>> > +	__u64	flags;
>>> > +	__u64	start_index;
>>> > +};
>>> > +
>>> > +/* Values for kvm_get_htab_fd.flags */
>>> > +#define KVM_GET_HTAB_BOLTED_ONLY	((__u64)0x1)
>>> > +#define KVM_GET_HTAB_WRITE		((__u64)0x2)
>>> > +
>>> > +The `start_index' field gives the index in the HPT of the entry at
>>> > +which to start reading.  It is ignored when writing.
>>> > +
>>> > +Reads on the fd will initially supply information about all
>>> > +"interesting" HPT entries.  Interesting entries are those with the
>>> > +bolted bit set, if the KVM_GET_HTAB_BOLTED_ONLY bit is set, otherwise
>>> > +all entries.  When the end of the HPT is reached, the read() will
>>> > +return.  
>>> 
>>> What happens if the read buffer is smaller than the HPT size?
>> 
>> That's fine; the read stops when it has filled the buffer and a
>> subsequent read will continue from where the previous one finished.
>> 
>>> What happens if the read buffer size is not a multiple of entry size?
>> 
>> Then we don't use the last few bytes of the buffer.  The read() call
>> returns the number of bytes that were filled in, of course.  In any
>> case, the header size is 8 bytes and the HPT entry size is 16 bytes,
>> so the number of bytes filled in won't necessarily be a multiple of 16
>> bytes.
>
> That's sane and expected, but it should be documented.
>
>> 
>>> Does/should the fd support O_NONBLOCK and poll? (=waiting for an entry
>>> to change).
>> 
>> No.
>
> This forces userspace to dedicate a thread for the HPT.

If no changes are available, does read return a size > 0?  I don't think
it's necessary to support polling.  The kernel should always be able to
respond to userspace here.  The only catch is whether to return !0 read
sizes when there are no changes.

At any case, I can't see why a dedicated thread is needed.  QEMU is
going to poll HPT based on how fast we can send data over the wire.

>>> > If read() is called again on the fd, it will start again from
>>> > +the beginning of the HPT, but will only return HPT entries that have
>>> > +changed since they were last read.
>>> > +
>>> > +Data read or written is structured as a header (8 bytes) followed by a
>>> > +series of valid HPT entries (16 bytes) each.  The header indicates how
>>> > +many valid HPT entries there are and how many invalid entries follow
>>> > +the valid entries.  The invalid entries are not represented explicitly
>>> > +in the stream.  The header format is:
>>> > +
>>> > +struct kvm_get_htab_header {
>>> > +	__u32	index;
>>> > +	__u16	n_valid;
>>> > +	__u16	n_invalid;
>>> > +};
>>> 
>>> This structure forces the kernel to return entries sequentially.  Will
>>> this block changing the data structure in the future?  Or is the
>>> hardware spec sufficiently strict that such changes are not realistic?
>> 
>> By "data structure", do you mean the stream format on the file
>> descriptor, or the HPT structure?  If we want a new stream format,
>> then we would define a bit in the flags field of struct
>> kvm_get_htab_fd to mean "I want the new stream format".  The code
>> fails the ioctl if any unknown flag bits are set, so a new userspace
>> that wants to use the new format could then detect that it is running
>> on an old kernel and fall back to the old format.
>> 
>> The HPT entry format is very unlikely to change in size or basic
>> layout (though the architects do redefine some of the bits
>> occasionally).
>
> I meant the internal data structure that holds HPT entries.
>
> I guess I don't understand the index.  Do we expect changes to be in
> contiguous ranges?  And invalid entries to be contiguous as well?  That
> doesn't fit with how hash tables work.  Does the index represent the
> position of the entry within the table, or something else?
>
>
>> 
>>> > +
>>> > +Writes to the fd create HPT entries starting at the index given in the
>>> > +header; first `n_valid' valid entries with contents from the data
>>> > +written, then `n_invalid' invalid entries, invalidating any previously
>>> > +valid entries found.
>>> 
>>> This scheme is a clever, original, and very interesting approach to live
>>> migration.  That doesn't necessarily mean a NAK, we should see if it
>>> makes sense for other migration APIs as well (we currently have
>>> difficulties migrating very large/wide guests).
>>> 
>>> What is the typical number of entries in the HPT?  Do you have estimates
>>> of the change rate?
>> 
>> Typically the HPT would have about a million entries, i.e. it would be
>> 16MiB in size.  The usual guideline is to make it about 1/64 of the
>> maximum amount of RAM the guest could ever have, rounded up to a power
>> of two, although we often run with less, say 1/128 or even 1/256.
>
> 16MiB is transferred in ~0.15 sec on GbE, much faster with 10GbE.  Does
> it warrant a live migration protocol?

0.15 sec == 150ms.  The typical downtime window is 30ms.  So yeah, I
think it does.

>> Because it is a hash table, updates tend to be scattered throughout
>> the whole table, which is another reason why per-page dirty tracking
>> and updates would be pretty inefficient.
>
> This suggests a stream format that includes the index in every entry.
>
>> 
>> As for the change rate, it depends on the application of course, but
>> basically every time the guest changes a PTE in its Linux page tables
>> we do the corresponding change to the corresponding HPT entry, so the
>> rate can be quite high.  Workloads that do a lot of fork, exit, mmap,
>> exec, etc. have a high rate of HPT updates.
>
> If the rate is high enough, then there's no point in a live update.

Do we have practical data here?

Regards,

Anthony Liguori

>
>> 
>>> Suppose new hardware arrives that supports nesting HPTs, so that kvm is
>>> no longer synchronously aware of the guest HPT (similar to how NPT/EPT
>>> made kvm unaware of guest virtual->physical translations on x86).  How
>>> will we deal with that?  But I guess this will be a
>>> non-guest-transparent and non-userspace-transparent change, unlike
>>> NPT/EPT, so a userspace ABI addition will be needed anyway).
>> 
>> Nested HPTs or other changes to the MMU architecture would certainly
>> need new guest kernels and new support in KVM.  With a nested
>> approach, the guest-side MMU data structures (HPT or whatever) would
>> presumably be in guest memory and thus be handled along with all the
>> other guest memory, while the host-side MMU data structures would not
>> need to be saved, so from the migration point of view that would make
>> it all a lot simpler.
>
> Yeah.
>
>
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html