Re: [PATCH 01/12] Introduce virNodeHugeTLB

Michal Privoznik <mprivozn@xxxxxxxxxx> · Fri, 30 May 2014 13:41:06 +0200

On 30.05.2014 10:52, Daniel P. Berrange wrote:
On Thu, May 29, 2014 at 10:32:35AM +0200, Michal Privoznik wrote:
  /**
+ * virNodeHugeTLB:
+ * @conn: pointer to the hypervisor connection
+ * @type: type
+ * @params: pointer to memory parameter object
+ *          (return value, allocated by the caller)
+ * @nparams: pointer to number of memory parameters; input and output
+ * @flags: extra flags; not used yet, so callers should always pass 0
+ *
+ * Get information about host's huge pages. On input, @nparams
+ * gives the size of the @params array; on output, @nparams gives
+ * how many slots were filled with parameter information, which
+ * might be less but will not exceed the input value.
+ *
+ * As a special case, calling with @params as NULL and @nparams
+ * as 0 on input will cause @nparams on output to contain the
+ * number of parameters supported by the hypervisor. The caller
+ * should then allocate @params array, i.e.
+ * (sizeof(@virTypedParameter) * @nparams) bytes and call the API
+ * again.  See virDomainGetMemoryParameters() for an equivalent
+ * usage example.
+ *
+ * Returns 0 in case of success, and -1 in case of failure.
+ */
+int
+virNodeHugeTLB(virConnectPtr conn,
+               int type,
+               virTypedParameterPtr params,
+               int *nparams,
+               unsigned int flags)

What is the 'type' parameter doing ?

Ah, it should be named numa_node rather than type. If type==-1, then 
overall statistics are returned (number of {available,free} pages 
accumulated across all NUMA nodes), if type >= 0, info on the specific 
NUMA node is returned.

I think in general this API needs a different design. I'd like to have
an API that can request info for all page sizes on all NUMA nods in a
single call. I also think the static unchanging data should be part of
the cpu + NUMA info in the capabilities XML. So the API only reports
info which is changing - ie the available pages.

The only problem is, the size of huge pages pool is not immutable. Now 
it's possible for 2M huge pages to be allocated dynamically:

# echo 8 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

and it may be possible for 1GB too in future (what if kernel learns how 
to do it?). In general, the only thing that we can take unalterable for 
now is the default size of huge pages. And I wouldn't bet on that either.

In the <cpu> element, we should report which page sizes are available
for the CPU model.

Okay.

In the <topology> element we should report the number of pages of each
size that are present in that node. We shouldn't treat huge pages
separately from small pages in this respect.

Well, if we do that you'll still need two API calls (and hence you'll 
lack atomicity): virConnectGetCapabilities() followed by 
virNodeGetFreePages(). On the other hand, the virNodeGetFreePages() 
itself is not atomic either (from system POV - other processes can jump 
in and allocate huge pages as libvirtd is executing the API).

So as an example

  - CPU supports 3 page sizes 4k, 2MB, 1GB
  - 2 numa nodes
  - 3 GB of memory per numa node
  - First node has
     - 262144  * 4k pages
     - 2 * 1 GB pages
  - Second node has
     - 1536 * 2 MB pages

This would look like this

   <host>
     <cpu>
       <arch>x86_64</arch>
       <model>Westmere</model>
       <vendor>Intel</vendor>
       <topology sockets='1' cores='6' threads='2'/>
       <feature name='rdtscp'/>
       <feature name='pdpe1gb'/>
       <feature name='dca'/>
       <feature name='pdcm'/>
       <feature name='xtpr'/>
       <pages units="KiB" size="4"/>
       <pages units="MiB" size="2"/>
       <pages units="GiB" size="1"/>
     </cpu>

Right. This makes sense.

     <topology>
       <cells num='2'>
         <cell id='0'>
           <memory unit='KiB'>3221225472</memory>
           <pages unit="KiB"  size="4">262144</pages>
           <pages unit="MiB"  size="2">0</pages>
           <pages unit="GiB"  size="1">2</pages>
           <cpus num='4'>
             <cpu id='0'/>
             <cpu id='2'/>
             <cpu id='4'/>
             <cpu id='6'/>
           </cpus>
         </cell>
         <cell id='1'>
           <memory unit='KiB'>3221225472</memory>
           <pages unit="KiB"  size="4">0</pages>
           <pages unit="MiB"  size="2">1536</pages>
           <pages unit="GiB"  size="1">2</pages>
           <cpus num='4'>
             <cpu id='1'/>
             <cpu id='3'/>
             <cpu id='5'/>
             <cpu id='7'/>
           </cpus>
         </cell>
       </cells>
     </topology

Well, and this to some extent too. We can report the pool size, but 
since it may vary it's on the same level as number of free pages. What 
about:

<pages unit='MiB' size='2' free='8' avail='8'/>

That way we don't even need a new API (not saying we shouldn't implement 
it though - API is still better than updating capabilities each single 
time).

So then an API call to request the available pages on all nodes
would look something like

    virNodeGetFreePages(virConnectPtr conn,
                        unsigned int *pages,
                        unsigned int npages,
                        unsigned int startcell,
                        unsigned int cellcount,
                        unsigned long long *counts);

In this API

  @pages - array whose elements are the page sizes to request info for
  @npages - number of elements in @pages
  @startcell - ID of first NUMA cell to request data for
  @cellcount - number of cells to request data for
  @counts - array which is @npages * @cellcount in length

I wanted to use virTypedParam to make the API open for future addition 
(e.g. if somebody needs count of surplus or reserved huge pages).

So if you want free count for all page sizes on all NUMA nodes
you might use this as

    unsigned int pages[] = { 4096, 2097152, 1073741824}
    unsigned int npages = ARRAY_CARDINALITY(pages);
    unsigned int startcell = 0;
    unsigned int cellcount = 2;

    unsigned long long counts = malloc(sizeof(long long) * npages * cellcount);

    virNodeGetFreePages(conn, pages, npages,
                        startcell, cellcount, counts);

    for (i = 0 ; i < cellcount ; i++) {
        fprintf(stdout, "Cell %d\n", startcell + i);
        for (j = 0 ; j < npages ; j++) {
           fprintf(stdout, "  Page size=%d count=%d bytes=%llu\n",
                   pages[j], counts[(i * npages) +  j],
                   pages[j] * counts[(i * npages) +  j]);
        }
        fprintf(stderr, "\n");
    }

  Cell 0
     Page size=4096 count=300 bytes=1228800
     Page size=2097152 count=0 bytes=0
     Page size=1073741824 count=1 bytes=1073741824
  Cell 1
     Page size=4096 count=0 bytes=0
     Page size=2097152 count=20 bytes=41943040
     Page size=1073741824 count=0 bytes=0

Or you could request free count for one specific node, or for one specific
page size.

This new API would basically obsolete the existing virNodeGetCellsFreeMemory
by providing something that gave you data on all pages at once, instead of
only data on the smallest page size.

Currently, the API I'm proposing returns tuple (PAGE_SIZE, 
PAGE_AVAILABLE, PAGE_FREE) repeated for each huge page size. For instance:

PAGE_SIZE=2048, PAGE_AVAILABLE=8, PAGE_FREE=8 ,PAGE_SIZE=1048576, 
PAGE_AVAILABLE=4, PAGE_FREE=2

If we want to have atomic API, how about fitting in NUMA node number?

NUMA_NODE=0, PAGE_SIZE=2048, PAGE_AVAILABLE=..., PAGE_SIZE=1048576,...\
NUMA_NODE=1, PAGE_SIZE=2048, ...

So the API would then look like

virNodeHugeTLB(virConnectPtr conn,
               virTypedParameterPtr params,
               int *nparams,
               unsigned int flags);

Michal

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list