Re: [PATCH 3/4] test driver: Store memory per NUMA node

John Ferlan <jferlan@xxxxxxxxxx> · Thu, 13 Oct 2016 17:35:49 -0400

On 10/12/2016 10:30 PM, Michal Privoznik wrote:
> On 13.10.2016 05:12, John Ferlan wrote:
>>
>>
>> On 10/05/2016 03:30 AM, Michal Privoznik wrote:
>>> In d18c7d7124 we have tried to implement virNodeGetFreePages API
>>> to test driver. And for a very limited definition of success we
>>> have succeeded. But, we can do better. Firstly, we can teach our
>>> internal representation of a NUMA cell that there are different
>>> page sizes and that they create a pool (from which apps alloc
>>> some). For the demonstration purposes a half of the pool is taken
>>> by default, therefore only the other half is free.
>>>
>>> This representation described above also makes it perfect for
>>> implementing virNodeAllocPages API (yet to come in a subsequent
>>> commit).
>>>
>>> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx>
>>> ---
>>>  src/test/test_driver.c | 100 +++++++++++++++++++++++++++++++++++++------------
>>>  1 file changed, 76 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/src/test/test_driver.c b/src/test/test_driver.c
>>> index b760b4f..a3f74f8 100644
>>> --- a/src/test/test_driver.c
>>> +++ b/src/test/test_driver.c
>>> @@ -72,9 +72,16 @@ VIR_LOG_INIT("test.test_driver");
>>>  
>>>  #define MAX_CPUS 128
>>>  
>>> +struct _testPage {
>>> +    unsigned long pagesize; /* in KiB */
>>> +    unsigned long pages;    /* in total, pagesFree should never
>>> +                               be greater than this. */
>>> +    unsigned long long pagesFree; /* free pages */
>>> +};
>>> +
>>>  struct _testCell {
>>> -    unsigned long mem;
>>> -    unsigned long freeMem;
>>> +    size_t npages;
>>> +    struct _testPage *pages;
>>>      int numCpus;
>>>      virCapsHostNUMACellCPU cpus[MAX_CPUS];
>>>  };
>>> @@ -319,16 +326,17 @@ testBuildCapabilities(virConnectPtr conn)
>>>      if (virCapabilitiesAddHostFeature(caps, "nonpae") < 0)
>>>          goto error;
>>>  
>>> -    if (VIR_ALLOC_N(caps->host.pagesSize, 2) < 0)
>>> +    if (VIR_ALLOC_N(caps->host.pagesSize, privconn->cells[0].npages) < 0)
>>>          goto error;
>>
>> I have very limited knowledge of NUMA/Cell, but I guess I don't
>> understand why the host.pagesSize is only referencing the cells[0]
>> values here and in the subsequent loop. It's all a mis
> 
> 
> Well, so far I'm assuming that all the NUMA nodes support all the sizes
> of huge pages. IOW all the NUMA nodes are the same from this specific
> POV. So if NUMA node #0 supports say 4K and 2M pages, the rest of the
> nodes do as well. Therefore, I don't have to make this more complex than
> it already is.
> 
>>
>> Shouldn't this be checking the various cells[n] for the page sizes
>> supported and then allocating pagesSize based upon the different sizes?
> 
> Yes, if we were to have different sizes for different NUMA nodes. But we
> don't. And frankly, I've never ever seen a machine out there in the wild
> which does have separate page sizes on NUMA nodes. Maybe there is one.
> But even if there is one, question is whether we want the test driver to
> reflect that. And the next question is whether we want to do it in a
> separate patch (and merging this one meanwhile) or in this one.
> 

Guess I was thinking too hard ;-)  in trying to understand the
algorithm, but you're right - this is a test driver for a simple case
and we're taking a shortcut... Something that you just document - what
you were thinking - although perhaps someone else with a deeper
understand of NUMA would say, well duh why document that...

>>
>> Then when filling things in the nPagesSize[n] would be based on the
>> cells page sizes found?
> 
> You mean pageSize[n]? Yes.
> 
>>
>> It just doesn't look right with the usage of [0].  The config has 2
>> cells that each have 2 pages. The host.pagesSize would then be a list of
>> the page sizes found, right?
> 
> Yes. But then again - since all our virtual NUMA nodes for the test
> driver have the same page sizes, we can do this kind of shortcut.
> 
>>
>> Future math could then find the number of pages and pagesFree for each
>> specific size.
>>
>>>  
>>> -    caps->host.pagesSize[caps->host.nPagesSize++] = 4;
>>> -    caps->host.pagesSize[caps->host.nPagesSize++] = 2048;
>>> +    for (i = 0; i < privconn->cells[i].npages; i++)
>>> +        caps->host.pagesSize[caps->host.nPagesSize++] = privconn->cells[0].pages[i].pagesize;
>>>  
>>>      for (i = 0; i < privconn->numCells; i++) {
>>>          virCapsHostNUMACellCPUPtr cpu_cells;
>>>          virCapsHostNUMACellPageInfoPtr pages;
>>>          size_t nPages;
>>> +        unsigned long mem = 0;
>>>  
>>>          if (VIR_ALLOC_N(cpu_cells, privconn->cells[i].numCpus) < 0 ||
>>>              VIR_ALLOC_N(pages, caps->host.nPagesSize) < 0) {
>>> @@ -341,12 +349,14 @@ testBuildCapabilities(virConnectPtr conn)
>>>          memcpy(cpu_cells, privconn->cells[i].cpus,
>>>                 sizeof(*cpu_cells) * privconn->cells[i].numCpus);
>>>  
>>
>> I would remove the local nPages it's confusing...
> 
> Ah, I felt the opposite. But okay. Consider it gone.
> 
>>
>> Whether the rest of the following hunk would seem to rely on whether my
>> theory above is true.
>>
>> But essentially filling in the pages[N] would rely on finding the cells
>> with the matching page size from host.pagesSize[N]. IOW: I would think
>> there needs to be an if statement ensuring that cells[i].pagesize ==
>> host.pagesSize[N].  Reading this infernal dual loops is always painful.
> 
> Right it is. This is where we have to do the extra step, because usually
> - in other drivers - we just ask kernel (via sysfs). And this double
> loop is implemented in it then.
> 
>>
>>> -        for (j = 0; j < nPages; j++)
>>> -            pages[j].size = caps->host.pagesSize[j];
>>> +        for (j = 0; j < nPages; j++) {
>>> +            pages[j].size = privconn->cells[i].pages[j].pagesize;
>>> +            pages[j].avail = privconn->cells[i].pages[j].pages;
>>>  
>>> -        pages[0].avail = privconn->cells[i].mem / pages[0].size;
>>> +            mem += pages[j].size * pages[j].avail;
>>> +        }
>>>  
>>> -        if (virCapabilitiesAddHostNUMACell(caps, i, privconn->cells[i].mem,
>>> +        if (virCapabilitiesAddHostNUMACell(caps, i, mem,
>>>                                             privconn->cells[i].numCpus,
>>>                                             cpu_cells, 0, NULL, nPages, pages) < 0)
>>>              goto error;
>>> @@ -1285,8 +1295,20 @@ testOpenDefault(virConnectPtr conn)
>>>      privconn->numCells = 2;
>>>      for (i = 0; i < privconn->numCells; i++) {
>>>          privconn->cells[i].numCpus = 8;
>>> -        privconn->cells[i].mem = (i + 1) * 2048 * 1024;
>>> -        privconn->cells[i].freeMem = (i + 1) * 1024 * 1024;
>>> +
>>> +        if (VIR_ALLOC_N(privconn->cells[i].pages, 2) < 0)
>>> +            goto error;
>>> +        privconn->cells[i].npages = 2;
>>> +
>>> +        /* Let the node have some 4k pages */
>>> +        privconn->cells[i].pages[0].pagesize = 4;
>>> +        privconn->cells[i].pages[0].pages = (i + 1) * 2048 * 1024;
>>> +        privconn->cells[i].pages[0].pagesFree = (i + 1) * 1024 * 1024;
>>
>> Not that it probably matters since we're not doing allocations, but I
>> assume the " * 1024" was cut-n-paste from the old .mem and .freeMem
>> which were I believe byte based while your new structure is kb based...
>>
>>> +
>>> +        /* And also some 2M pages */
>>> +        privconn->cells[i].pages[1].pagesize = 2048;
>>> +        privconn->cells[i].pages[1].pages = (i + 1) * 2048;
>>> +        privconn->cells[i].pages[1].pagesFree = (i + 1) * 128;
>>>      }
>>>      for (i = 0; i < 16; i++) {
>>>          virBitmapPtr siblings = virBitmapNew(16);
>>> @@ -2719,7 +2741,9 @@ static int testNodeGetCellsFreeMemory(virConnectPtr conn,
>>>      for (cell = startCell, i = 0;
>>>           (cell < privconn->numCells && i < maxCells);
>>>           ++cell, ++i) {
>>> -        freemems[i] = privconn->cells[cell].mem;
>>> +        struct _testCell *tmp = &privconn->cells[cell];
>>> +
>>> +        freemems[i] = tmp->pages[0].pagesize * tmp->pages[0].pagesFree * 1024;
>>
>> Why aren't the 2m pages[1] included?
> 
> Because freecell only contains 'free memory per each NUMA node'. If
> you'd run 'virsh -c qemu:///system freecell --all' you will not see 2M
> (or any other huge pages) either. Just the regular 4k pages.
> This is basically stupidity of Linux kernel while it threats pages of
> different sizes differently. There are regular system pages (4k) and
> these show up in 'free -m', 'virsh freecell', 'top', ... You don't have
> reserve a pool of them if you want to use them, or allocate them via
> special syscall. Then there are other sizes (usually 2M or 1G) and oh
> boy, everything's different. They don't show up in any of the previous
> tools, and basically - they are treated quite the opposite to what I
> described.
> This is very stupid approach. That's why we have some additional code in
> libvirt that treats all of the pages equally. For instance, you can
> configure your domain to use "hugepages" of size 4K, or 'virsh freepages
> --all' shows ALL the page sizes.

Ahhh... I see... It's all very confusing and unless you know the rules
and history, it's easy to get lost.

John

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list