On 10/12/2016 10:30 PM, Michal Privoznik wrote: > On 13.10.2016 05:12, John Ferlan wrote: >> >> >> On 10/05/2016 03:30 AM, Michal Privoznik wrote: >>> In d18c7d7124 we have tried to implement virNodeGetFreePages API >>> to test driver. And for a very limited definition of success we >>> have succeeded. But, we can do better. Firstly, we can teach our >>> internal representation of a NUMA cell that there are different >>> page sizes and that they create a pool (from which apps alloc >>> some). For the demonstration purposes a half of the pool is taken >>> by default, therefore only the other half is free. >>> >>> This representation described above also makes it perfect for >>> implementing virNodeAllocPages API (yet to come in a subsequent >>> commit). >>> >>> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> >>> --- >>> src/test/test_driver.c | 100 +++++++++++++++++++++++++++++++++++++------------ >>> 1 file changed, 76 insertions(+), 24 deletions(-) >>> >>> diff --git a/src/test/test_driver.c b/src/test/test_driver.c >>> index b760b4f..a3f74f8 100644 >>> --- a/src/test/test_driver.c >>> +++ b/src/test/test_driver.c >>> @@ -72,9 +72,16 @@ VIR_LOG_INIT("test.test_driver"); >>> >>> #define MAX_CPUS 128 >>> >>> +struct _testPage { >>> + unsigned long pagesize; /* in KiB */ >>> + unsigned long pages; /* in total, pagesFree should never >>> + be greater than this. */ >>> + unsigned long long pagesFree; /* free pages */ >>> +}; >>> + >>> struct _testCell { >>> - unsigned long mem; >>> - unsigned long freeMem; >>> + size_t npages; >>> + struct _testPage *pages; >>> int numCpus; >>> virCapsHostNUMACellCPU cpus[MAX_CPUS]; >>> }; >>> @@ -319,16 +326,17 @@ testBuildCapabilities(virConnectPtr conn) >>> if (virCapabilitiesAddHostFeature(caps, "nonpae") < 0) >>> goto error; >>> >>> - if (VIR_ALLOC_N(caps->host.pagesSize, 2) < 0) >>> + if (VIR_ALLOC_N(caps->host.pagesSize, privconn->cells[0].npages) < 0) >>> goto error; >> >> I have very limited knowledge of NUMA/Cell, but I guess I don't >> understand why the host.pagesSize is only referencing the cells[0] >> values here and in the subsequent loop. It's all a mis > > > Well, so far I'm assuming that all the NUMA nodes support all the sizes > of huge pages. IOW all the NUMA nodes are the same from this specific > POV. So if NUMA node #0 supports say 4K and 2M pages, the rest of the > nodes do as well. Therefore, I don't have to make this more complex than > it already is. > >> >> Shouldn't this be checking the various cells[n] for the page sizes >> supported and then allocating pagesSize based upon the different sizes? > > Yes, if we were to have different sizes for different NUMA nodes. But we > don't. And frankly, I've never ever seen a machine out there in the wild > which does have separate page sizes on NUMA nodes. Maybe there is one. > But even if there is one, question is whether we want the test driver to > reflect that. And the next question is whether we want to do it in a > separate patch (and merging this one meanwhile) or in this one. > Guess I was thinking too hard ;-) in trying to understand the algorithm, but you're right - this is a test driver for a simple case and we're taking a shortcut... Something that you just document - what you were thinking - although perhaps someone else with a deeper understand of NUMA would say, well duh why document that... >> >> Then when filling things in the nPagesSize[n] would be based on the >> cells page sizes found? > > You mean pageSize[n]? Yes. > >> >> It just doesn't look right with the usage of [0]. The config has 2 >> cells that each have 2 pages. The host.pagesSize would then be a list of >> the page sizes found, right? > > Yes. But then again - since all our virtual NUMA nodes for the test > driver have the same page sizes, we can do this kind of shortcut. > >> >> Future math could then find the number of pages and pagesFree for each >> specific size. >> >>> >>> - caps->host.pagesSize[caps->host.nPagesSize++] = 4; >>> - caps->host.pagesSize[caps->host.nPagesSize++] = 2048; >>> + for (i = 0; i < privconn->cells[i].npages; i++) >>> + caps->host.pagesSize[caps->host.nPagesSize++] = privconn->cells[0].pages[i].pagesize; >>> >>> for (i = 0; i < privconn->numCells; i++) { >>> virCapsHostNUMACellCPUPtr cpu_cells; >>> virCapsHostNUMACellPageInfoPtr pages; >>> size_t nPages; >>> + unsigned long mem = 0; >>> >>> if (VIR_ALLOC_N(cpu_cells, privconn->cells[i].numCpus) < 0 || >>> VIR_ALLOC_N(pages, caps->host.nPagesSize) < 0) { >>> @@ -341,12 +349,14 @@ testBuildCapabilities(virConnectPtr conn) >>> memcpy(cpu_cells, privconn->cells[i].cpus, >>> sizeof(*cpu_cells) * privconn->cells[i].numCpus); >>> >> >> I would remove the local nPages it's confusing... > > Ah, I felt the opposite. But okay. Consider it gone. > >> >> Whether the rest of the following hunk would seem to rely on whether my >> theory above is true. >> >> But essentially filling in the pages[N] would rely on finding the cells >> with the matching page size from host.pagesSize[N]. IOW: I would think >> there needs to be an if statement ensuring that cells[i].pagesize == >> host.pagesSize[N]. Reading this infernal dual loops is always painful. > > Right it is. This is where we have to do the extra step, because usually > - in other drivers - we just ask kernel (via sysfs). And this double > loop is implemented in it then. > >> >>> - for (j = 0; j < nPages; j++) >>> - pages[j].size = caps->host.pagesSize[j]; >>> + for (j = 0; j < nPages; j++) { >>> + pages[j].size = privconn->cells[i].pages[j].pagesize; >>> + pages[j].avail = privconn->cells[i].pages[j].pages; >>> >>> - pages[0].avail = privconn->cells[i].mem / pages[0].size; >>> + mem += pages[j].size * pages[j].avail; >>> + } >>> >>> - if (virCapabilitiesAddHostNUMACell(caps, i, privconn->cells[i].mem, >>> + if (virCapabilitiesAddHostNUMACell(caps, i, mem, >>> privconn->cells[i].numCpus, >>> cpu_cells, 0, NULL, nPages, pages) < 0) >>> goto error; >>> @@ -1285,8 +1295,20 @@ testOpenDefault(virConnectPtr conn) >>> privconn->numCells = 2; >>> for (i = 0; i < privconn->numCells; i++) { >>> privconn->cells[i].numCpus = 8; >>> - privconn->cells[i].mem = (i + 1) * 2048 * 1024; >>> - privconn->cells[i].freeMem = (i + 1) * 1024 * 1024; >>> + >>> + if (VIR_ALLOC_N(privconn->cells[i].pages, 2) < 0) >>> + goto error; >>> + privconn->cells[i].npages = 2; >>> + >>> + /* Let the node have some 4k pages */ >>> + privconn->cells[i].pages[0].pagesize = 4; >>> + privconn->cells[i].pages[0].pages = (i + 1) * 2048 * 1024; >>> + privconn->cells[i].pages[0].pagesFree = (i + 1) * 1024 * 1024; >> >> Not that it probably matters since we're not doing allocations, but I >> assume the " * 1024" was cut-n-paste from the old .mem and .freeMem >> which were I believe byte based while your new structure is kb based... >> >>> + >>> + /* And also some 2M pages */ >>> + privconn->cells[i].pages[1].pagesize = 2048; >>> + privconn->cells[i].pages[1].pages = (i + 1) * 2048; >>> + privconn->cells[i].pages[1].pagesFree = (i + 1) * 128; >>> } >>> for (i = 0; i < 16; i++) { >>> virBitmapPtr siblings = virBitmapNew(16); >>> @@ -2719,7 +2741,9 @@ static int testNodeGetCellsFreeMemory(virConnectPtr conn, >>> for (cell = startCell, i = 0; >>> (cell < privconn->numCells && i < maxCells); >>> ++cell, ++i) { >>> - freemems[i] = privconn->cells[cell].mem; >>> + struct _testCell *tmp = &privconn->cells[cell]; >>> + >>> + freemems[i] = tmp->pages[0].pagesize * tmp->pages[0].pagesFree * 1024; >> >> Why aren't the 2m pages[1] included? > > Because freecell only contains 'free memory per each NUMA node'. If > you'd run 'virsh -c qemu:///system freecell --all' you will not see 2M > (or any other huge pages) either. Just the regular 4k pages. > This is basically stupidity of Linux kernel while it threats pages of > different sizes differently. There are regular system pages (4k) and > these show up in 'free -m', 'virsh freecell', 'top', ... You don't have > reserve a pool of them if you want to use them, or allocate them via > special syscall. Then there are other sizes (usually 2M or 1G) and oh > boy, everything's different. They don't show up in any of the previous > tools, and basically - they are treated quite the opposite to what I > described. > This is very stupid approach. That's why we have some additional code in > libvirt that treats all of the pages equally. For instance, you can > configure your domain to use "hugepages" of size 4K, or 'virsh freepages > --all' shows ALL the page sizes. Ahhh... I see... It's all very confusing and unless you know the rules and history, it's easy to get lost. John -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list