Re: [PATCH 3/4] test driver: Store memory per NUMA node

Michal Privoznik <mprivozn@xxxxxxxxxx> · Thu, 13 Oct 2016 10:30:43 +0800

On 13.10.2016 05:12, John Ferlan wrote:
> 
> 
> On 10/05/2016 03:30 AM, Michal Privoznik wrote:
>> In d18c7d7124 we have tried to implement virNodeGetFreePages API
>> to test driver. And for a very limited definition of success we
>> have succeeded. But, we can do better. Firstly, we can teach our
>> internal representation of a NUMA cell that there are different
>> page sizes and that they create a pool (from which apps alloc
>> some). For the demonstration purposes a half of the pool is taken
>> by default, therefore only the other half is free.
>>
>> This representation described above also makes it perfect for
>> implementing virNodeAllocPages API (yet to come in a subsequent
>> commit).
>>
>> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx>
>> ---
>>  src/test/test_driver.c | 100 +++++++++++++++++++++++++++++++++++++------------
>>  1 file changed, 76 insertions(+), 24 deletions(-)
>>
>> diff --git a/src/test/test_driver.c b/src/test/test_driver.c
>> index b760b4f..a3f74f8 100644
>> --- a/src/test/test_driver.c
>> +++ b/src/test/test_driver.c
>> @@ -72,9 +72,16 @@ VIR_LOG_INIT("test.test_driver");
>>  
>>  #define MAX_CPUS 128
>>  
>> +struct _testPage {
>> +    unsigned long pagesize; /* in KiB */
>> +    unsigned long pages;    /* in total, pagesFree should never
>> +                               be greater than this. */
>> +    unsigned long long pagesFree; /* free pages */
>> +};
>> +
>>  struct _testCell {
>> -    unsigned long mem;
>> -    unsigned long freeMem;
>> +    size_t npages;
>> +    struct _testPage *pages;
>>      int numCpus;
>>      virCapsHostNUMACellCPU cpus[MAX_CPUS];
>>  };
>> @@ -319,16 +326,17 @@ testBuildCapabilities(virConnectPtr conn)
>>      if (virCapabilitiesAddHostFeature(caps, "nonpae") < 0)
>>          goto error;
>>  
>> -    if (VIR_ALLOC_N(caps->host.pagesSize, 2) < 0)
>> +    if (VIR_ALLOC_N(caps->host.pagesSize, privconn->cells[0].npages) < 0)
>>          goto error;
> 
> I have very limited knowledge of NUMA/Cell, but I guess I don't
> understand why the host.pagesSize is only referencing the cells[0]
> values here and in the subsequent loop. It's all a mis

Well, so far I'm assuming that all the NUMA nodes support all the sizes
of huge pages. IOW all the NUMA nodes are the same from this specific
POV. So if NUMA node #0 supports say 4K and 2M pages, the rest of the
nodes do as well. Therefore, I don't have to make this more complex than
it already is.

> 
> Shouldn't this be checking the various cells[n] for the page sizes
> supported and then allocating pagesSize based upon the different sizes?

Yes, if we were to have different sizes for different NUMA nodes. But we
don't. And frankly, I've never ever seen a machine out there in the wild
which does have separate page sizes on NUMA nodes. Maybe there is one.
But even if there is one, question is whether we want the test driver to
reflect that. And the next question is whether we want to do it in a
separate patch (and merging this one meanwhile) or in this one.

> 
> Then when filling things in the nPagesSize[n] would be based on the
> cells page sizes found?

You mean pageSize[n]? Yes.

> 
> It just doesn't look right with the usage of [0].  The config has 2
> cells that each have 2 pages. The host.pagesSize would then be a list of
> the page sizes found, right?

Yes. But then again - since all our virtual NUMA nodes for the test
driver have the same page sizes, we can do this kind of shortcut.

> 
> Future math could then find the number of pages and pagesFree for each
> specific size.
> 
>>  
>> -    caps->host.pagesSize[caps->host.nPagesSize++] = 4;
>> -    caps->host.pagesSize[caps->host.nPagesSize++] = 2048;
>> +    for (i = 0; i < privconn->cells[i].npages; i++)
>> +        caps->host.pagesSize[caps->host.nPagesSize++] = privconn->cells[0].pages[i].pagesize;
>>  
>>      for (i = 0; i < privconn->numCells; i++) {
>>          virCapsHostNUMACellCPUPtr cpu_cells;
>>          virCapsHostNUMACellPageInfoPtr pages;
>>          size_t nPages;
>> +        unsigned long mem = 0;
>>  
>>          if (VIR_ALLOC_N(cpu_cells, privconn->cells[i].numCpus) < 0 ||
>>              VIR_ALLOC_N(pages, caps->host.nPagesSize) < 0) {
>> @@ -341,12 +349,14 @@ testBuildCapabilities(virConnectPtr conn)
>>          memcpy(cpu_cells, privconn->cells[i].cpus,
>>                 sizeof(*cpu_cells) * privconn->cells[i].numCpus);
>>  
> 
> I would remove the local nPages it's confusing...

Ah, I felt the opposite. But okay. Consider it gone.

> 
> Whether the rest of the following hunk would seem to rely on whether my
> theory above is true.
> 
> But essentially filling in the pages[N] would rely on finding the cells
> with the matching page size from host.pagesSize[N]. IOW: I would think
> there needs to be an if statement ensuring that cells[i].pagesize ==
> host.pagesSize[N].  Reading this infernal dual loops is always painful.

Right it is. This is where we have to do the extra step, because usually
- in other drivers - we just ask kernel (via sysfs). And this double
loop is implemented in it then.

> 
>> -        for (j = 0; j < nPages; j++)
>> -            pages[j].size = caps->host.pagesSize[j];
>> +        for (j = 0; j < nPages; j++) {
>> +            pages[j].size = privconn->cells[i].pages[j].pagesize;
>> +            pages[j].avail = privconn->cells[i].pages[j].pages;
>>  
>> -        pages[0].avail = privconn->cells[i].mem / pages[0].size;
>> +            mem += pages[j].size * pages[j].avail;
>> +        }
>>  
>> -        if (virCapabilitiesAddHostNUMACell(caps, i, privconn->cells[i].mem,
>> +        if (virCapabilitiesAddHostNUMACell(caps, i, mem,
>>                                             privconn->cells[i].numCpus,
>>                                             cpu_cells, 0, NULL, nPages, pages) < 0)
>>              goto error;
>> @@ -1285,8 +1295,20 @@ testOpenDefault(virConnectPtr conn)
>>      privconn->numCells = 2;
>>      for (i = 0; i < privconn->numCells; i++) {
>>          privconn->cells[i].numCpus = 8;
>> -        privconn->cells[i].mem = (i + 1) * 2048 * 1024;
>> -        privconn->cells[i].freeMem = (i + 1) * 1024 * 1024;
>> +
>> +        if (VIR_ALLOC_N(privconn->cells[i].pages, 2) < 0)
>> +            goto error;
>> +        privconn->cells[i].npages = 2;
>> +
>> +        /* Let the node have some 4k pages */
>> +        privconn->cells[i].pages[0].pagesize = 4;
>> +        privconn->cells[i].pages[0].pages = (i + 1) * 2048 * 1024;
>> +        privconn->cells[i].pages[0].pagesFree = (i + 1) * 1024 * 1024;
> 
> Not that it probably matters since we're not doing allocations, but I
> assume the " * 1024" was cut-n-paste from the old .mem and .freeMem
> which were I believe byte based while your new structure is kb based...
> 
>> +
>> +        /* And also some 2M pages */
>> +        privconn->cells[i].pages[1].pagesize = 2048;
>> +        privconn->cells[i].pages[1].pages = (i + 1) * 2048;
>> +        privconn->cells[i].pages[1].pagesFree = (i + 1) * 128;
>>      }
>>      for (i = 0; i < 16; i++) {
>>          virBitmapPtr siblings = virBitmapNew(16);
>> @@ -2719,7 +2741,9 @@ static int testNodeGetCellsFreeMemory(virConnectPtr conn,
>>      for (cell = startCell, i = 0;
>>           (cell < privconn->numCells && i < maxCells);
>>           ++cell, ++i) {
>> -        freemems[i] = privconn->cells[cell].mem;
>> +        struct _testCell *tmp = &privconn->cells[cell];
>> +
>> +        freemems[i] = tmp->pages[0].pagesize * tmp->pages[0].pagesFree * 1024;
> 
> Why aren't the 2m pages[1] included?

Because freecell only contains 'free memory per each NUMA node'. If
you'd run 'virsh -c qemu:///system freecell --all' you will not see 2M
(or any other huge pages) either. Just the regular 4k pages.
This is basically stupidity of Linux kernel while it threats pages of
different sizes differently. There are regular system pages (4k) and
these show up in 'free -m', 'virsh freecell', 'top', ... You don't have
reserve a pool of them if you want to use them, or allocate them via
special syscall. Then there are other sizes (usually 2M or 1G) and oh
boy, everything's different. They don't show up in any of the previous
tools, and basically - they are treated quite the opposite to what I
described.
This is very stupid approach. That's why we have some additional code in
libvirt that treats all of the pages equally. For instance, you can
configure your domain to use "hugepages" of size 4K, or 'virsh freepages
--all' shows ALL the page sizes.

Michal

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list