On 23.09.2014 21:49, Eric Blake wrote:
On 09/18/2014 02:24 AM, Michal Privoznik wrote:
A long time ago in a galaxy far, far away it has been decided
that libvirt will manage not only domains but host as well. And
with my latest work on qemu driver supporting huge pages, we miss
the cherry on top: an API to allocate huge pages on the run.
Currently users are forced to log into the host and adjust the
huge pages pool themselves. However, with this API the problem
is gone - they can both size up and size down the pool.
Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx>
---
+/**
+ * virNodeAllocPages:
+ * @conn: pointer to the hypervisor connection
+ * @npages: number of items in the @pageSizes array
I'd mention '@pageSizes and @pageCounts arrays'
+ * @pageSizes: which huge page sizes to allocate
+ * @pageCounts: how many pages should be allocated
+ * @startCell: index of first cell to allocate pages on
+ * @cellCount: number of consecutive cells to allocate pages on
+ * @flags: extra flags; binary-OR of virConnectGetAllDomainStatsFlags
+ *
+ * Sometimes, when trying to start a new domain, it may be
+ * necessary to reserve some huge pages in the system pool which
+ * can be then allocated by the domain. This API serves that
+ * purpose. On its input, @pageSizes and @pageCounts are arrays
+ * of the same cardinality of @npages. The @pageSizes contains
+ * page sizes which are to be allocated in the system (the size
+ * unit is kibibytes), and @pageCounts then contains the number
+ * of pages to reserve. Depending on @flags specified, the number
+ * of pages is either added into the pool and the pool is sized
+ * up (@flags have VIR_NODE_ALLOC_PAGES_ADD set) or the number of
Technically, VIR_NODE_ALLOC_PAGES_ADD is 0, so it can't be set :) Do we
even need this named constant, or can we just drop it?
Well, I'd like to keep it to explicitly state that the pool is added to.
You know:
unsigned int flags = 0;
if (<condition>)
flag |= VIR_NODE_ALLOC_PAGES_SET;
else
flag |= VIR_NODE_ALLOC_PAGES_ADD;
virNodeAlloPages(..., flags);
This is just a hint for users, the named value is certainly not needed.
I thought of following already pre-existing VIR_DOMAIN_AFFECT_CURRENT.
+ * pages is looked at the new size of pool and the pool can be
+ * both sized up or down (@flags have VIR_NODE_ALLOC_PAGES_SET
The wording is a bit awkward; maybe:
If @flags is 0 (VIR_NODE_ALLOC_PAGES_ADD), each pool corresponding to
@pageSizes grows by the number of pages specified in the corresponding
@pageCounts. If @flags contains VIR_NODE_ALLOC_PAGES_SET, each pool
mentioned is resized to the given number of pages.
+ * set). The pages pool can be allocated over several NUMA nodes
+ * at once, just point at @startCell and tell how many subsequent
+ * NUMA nodes should be taken in.
Is there shorthand such as @startCell==-1 for allocating across all NUMA
nodes?
Yes there is. I've updated the documentation.
Is there checking that @npages cannot exceed the number of NUMA
nodes available starting at @startCell through @cellCount?
Well, that's question for 3/4 but er, how to put it. No, not currently,
but I'll update 3/4.
If I understand correctly, usage is something like (pseudo-code):
virNodeAllocPages(conn, 2, (int[]){ [0]=2M, [1]=1G },
(int[]){ [0]=1024, [1]=1 },
1, 2, 0)
which proceeds to add 1024 pages of size 2M, and 1 page of size 1G, to
the pools associated both with NUMA node 1 and NUMA node 2 (that is, I
just allocated 6G across two nodes).
Exactly!
+ *
+ * Returns: the number of nodes successfully adjusted or -1 in
+ * case of an error.
Can this function partially succeed? That is, if I pass @npages==2 and
@cellCount==2, will I ever get a non-negative result less than 4? If
the result is 2, which allocations failed (are all allocations on the
first cell attempted before any on the second, or are all allocations to
the first pool size attempted across multiple cells before revisiting
cells to allocate in the second pool size)?
Since there's no rollback on partially successful returns (in fact, the
majority of our APIs have no fallback at all) I find it just okay to
have it like that. Moreover, runtime huge page allocation is still a bit
of magic: the first call may fail, while the later attempt may succeed.
Then - we already have an API to fetch the pool sizes
(virNodeGetFreePages) so if the Alloc() partially fails, users may fetch
which pools succeeded and which needs to be adjusted again.
+ */
+int
+virNodeAllocPages(virConnectPtr conn,
+ unsigned int npages,
+ unsigned int *pageSizes,
+ unsigned long long *pageCounts,
+ int startCell,
+ unsigned int cellCount,
+ unsigned int flags)
+{
+ VIR_DEBUG("conn=%p npages=%u pageSizes=%p pageCounts=%p "
+ "startCell=%d cellCount=%u flagx=%x",
+ conn, npages, pageSizes, pageCounts, startCell,
+ cellCount, flags);
+
+ virResetLastError();
+
+ virCheckConnectReturn(conn, -1);
+ virCheckNonZeroArgGoto(npages, error);
+ virCheckNonNullArgGoto(pageSizes, error);
+ virCheckNonNullArgGoto(pageCounts, error);
+
Where is the validation that startCell and cellCount make sense? I'd
assume we want to ensure cellCount is positive? Or is there a special
case of a count of 0 for visiting all NUMA cells without knowing in
advance how many there are?
Oh right. I'll add the check. If we ever want to go with your idea, we
can just drop the check.
+++ b/src/libvirt_public.syms
@@ -677,6 +677,7 @@ LIBVIRT_1.2.8 {
virDomainListGetStats;
virDomainOpenGraphicsFD;
virDomainStatsRecordListFree;
+ virNodeAllocPages;
} LIBVIRT_1.2.7;
Need a new section for 1.2.9.
I'm still not fully used to our new release style, apparently :) Thanks
for catching that.
Michal
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list