Re: [PATCH 01/12] Introduce virNodeHugeTLB

Michal Privoznik <mprivozn@xxxxxxxxxx> · Fri, 30 May 2014 15:42:01 +0200

On 30.05.2014 14:46, Daniel P. Berrange wrote:
On Fri, May 30, 2014 at 01:41:06PM +0200, Michal Privoznik wrote:
On 30.05.2014 10:52, Daniel P. Berrange wrote:
On Thu, May 29, 2014 at 10:32:35AM +0200, Michal Privoznik wrote:
  /**
+ * virNodeHugeTLB:
+ * @conn: pointer to the hypervisor connection
+ * @type: type
+ * @params: pointer to memory parameter object
+ *          (return value, allocated by the caller)
+ * @nparams: pointer to number of memory parameters; input and output
+ * @flags: extra flags; not used yet, so callers should always pass 0
+ *
+ * Get information about host's huge pages. On input, @nparams
+ * gives the size of the @params array; on output, @nparams gives
+ * how many slots were filled with parameter information, which
+ * might be less but will not exceed the input value.
+ *
+ * As a special case, calling with @params as NULL and @nparams
+ * as 0 on input will cause @nparams on output to contain the
+ * number of parameters supported by the hypervisor. The caller
+ * should then allocate @params array, i.e.
+ * (sizeof(@virTypedParameter) * @nparams) bytes and call the API
+ * again.  See virDomainGetMemoryParameters() for an equivalent
+ * usage example.
+ *
+ * Returns 0 in case of success, and -1 in case of failure.
+ */
+int
+virNodeHugeTLB(virConnectPtr conn,
+               int type,
+               virTypedParameterPtr params,
+               int *nparams,
+               unsigned int flags)

What is the 'type' parameter doing ?

Ah, it should be named numa_node rather than type. If type==-1, then overall
statistics are returned (number of {available,free} pages accumulated across
all NUMA nodes), if type >= 0, info on the specific NUMA node is returned.

I think in general this API needs a different design. I'd like to have
an API that can request info for all page sizes on all NUMA nods in a
single call. I also think the static unchanging data should be part of
the cpu + NUMA info in the capabilities XML. So the API only reports
info which is changing - ie the available pages.

The only problem is, the size of huge pages pool is not immutable. Now it's
possible for 2M huge pages to be allocated dynamically:

# echo 8 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

and it may be possible for 1GB too in future (what if kernel learns how to
do it?). In general, the only thing that we can take unalterable for now is
the default size of huge pages. And I wouldn't bet on that either.

Yes, you can in theory change the number of huge pages at an arbitrary
time, but realistically people mostly only do it immediately at boot.
With 1 GB pages is will be impossible todo it any time except immediately
at boot. If you wait a little while, then memory will be too fragmented
for you to be able to dynamically allocate more 1 GB pages. The same
is true of 2MB pages to a lesser degree.

IMO no. Processes never ever see physical address (PA). All they see are 
virtual addresses (VA). So there is possibility for kernel to rearrange 
physical memory without effect on the processes in order to gain bigger 
segments of free memory. The only problem are DMA transfers which access 
PA directly but that can be resolved by reserving first X {M,G}B of RAM 
for the transfers. And kernel is doing something similar already - 
Kernel SamePage Merging (KSM). On the other hand, I don't think huge 
page pool size is going to be changed too often. Especially when 
compared to huge page allocation in a userspace processes. So whenever 
mgmt application changes the pool size, it knows it has to refetch 
capabilities. Well, if we document it :-)

In the <topology> element we should report the number of pages of each
size that are present in that node. We shouldn't treat huge pages
separately from small pages in this respect.

Well, if we do that you'll still need two API calls (and hence you'll lack
atomicity): virConnectGetCapabilities() followed by virNodeGetFreePages().
On the other hand, the virNodeGetFreePages() itself is not atomic either
(from system POV - other processes can jump in and allocate huge pages as
libvirtd is executing the API).

I'm not worried about atomicity wrt to free pages - no matter what
you do you will always race, because there's inherantly a gap between
when you check the free pages and when the VM boots. I'm more interested
in the fact that the capabilities shows you the overall NUMA topology
of the host, and the pages allocated per node are inherantly part of
that model IMHO.

   <host>
     <cpu>
       <arch>x86_64</arch>
       <model>Westmere</model>
       <vendor>Intel</vendor>
       <topology sockets='1' cores='6' threads='2'/>
       <feature name='rdtscp'/>
       <feature name='pdpe1gb'/>
       <feature name='dca'/>
       <feature name='pdcm'/>
       <feature name='xtpr'/>
       <pages units="KiB" size="4"/>
       <pages units="MiB" size="2"/>
       <pages units="GiB" size="1"/>
     </cpu>

Right. This makes sense.

     <topology>
       <cells num='2'>
         <cell id='0'>
           <memory unit='KiB'>3221225472</memory>
           <pages unit="KiB"  size="4">262144</pages>
           <pages unit="MiB"  size="2">0</pages>
           <pages unit="GiB"  size="1">2</pages>
           <cpus num='4'>
             <cpu id='0'/>
             <cpu id='2'/>
             <cpu id='4'/>
             <cpu id='6'/>
           </cpus>
         </cell>
         <cell id='1'>
           <memory unit='KiB'>3221225472</memory>
           <pages unit="KiB"  size="4">0</pages>
           <pages unit="MiB"  size="2">1536</pages>
           <pages unit="GiB"  size="1">2</pages>
           <cpus num='4'>
             <cpu id='1'/>
             <cpu id='3'/>
             <cpu id='5'/>
             <cpu id='7'/>
           </cpus>
         </cell>
       </cells>
     </topology

Well, and this to some extent too. We can report the pool size, but since it
may vary it's on the same level as number of free pages. What about:

<pages unit='MiB' size='2' free='8' avail='8'/>

That way we don't even need a new API (not saying we shouldn't implement it
though - API is still better than updating capabilities each single time).

I'm reluctant to include the free count in the capabilities because
that will encourage people to query the capabilities overly frequently
which will be pretty inefficient. This is why we only provide free
memory info there currently and have the separate API call.

If we report the used/free count in the capabilities we're forced into
refreshing this data frequently. If we only report the overall count
per node, even if there's the possibility to dynamically change that,
we have more potential to avoid refreshing it frequently.

So then an API call to request the available pages on all nodes
would look something like

    virNodeGetFreePages(virConnectPtr conn,
                        unsigned int *pages,
                        unsigned int npages,
                        unsigned int startcell,
                        unsigned int cellcount,
                        unsigned long long *counts);

In this API

  @pages - array whose elements are the page sizes to request info for
  @npages - number of elements in @pages
  @startcell - ID of first NUMA cell to request data for
  @cellcount - number of cells to request data for
  @counts - array which is @npages * @cellcount in length

I wanted to use virTypedParam to make the API open for future addition (e.g.
if somebody needs count of surplus or reserved huge pages).

I'm not sure what you mean by 'reserved huge pages' - all non-4k page
counts we are reporting are "reserved pages".

I mean any of  /sys/kernel/mm/hugepages/hugepages-2048kB/*.
But we can invent virNodeGetResvHugepages() for that.

Michal

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list