On 22.01.21 19:16, Daniel Henrique Barboza wrote: > > > On 1/22/21 9:50 AM, Michal Privoznik wrote: >> The virtio-mem is paravirtualized mechanism of adding/removing >> memory to/from a VM. A virtio-mem-pci device is split into blocks >> of equal size which are then exposed (all or only a requested >> portion of them) to the guest kernel to use as regular memory. >> Therefore, the device has two important attributes: >> >> 1) block-size, which defines the size of a block >> 2) requested-size, which defines how much memory (in bytes) >> is the device requested to expose to the guest. >> >> The 'block-size' is configured on command line and immutable >> throughout device's lifetime. The 'requested-size' can be set on >> the command line too, but also is adjustable via monitor. In >> fact, that is how management software places its requests to >> change the memory allocation. If it wants to give more memory to >> the guest it changes 'requested-size' to a bigger value, and if it >> wants to shrink guest memory it changes the 'requested-size' to a >> smaller value. Note, value of zero means that guest should >> release all memory offered by the device. Of course, guest has to >> cooperate. Therefore, there is a third attribute 'size' which is >> read only and reflects how much memory the guest still has. This >> can be different to 'requested-size', obviously. Because of name >> clash, I've named it 'actualsize' and it is dealt with in future >> commits (it is a runtime information anyway). >> >> In the backend, memory for virtio-mem is backed by usual objects: >> memory-backend-{ram,file,memfd} and their size puts the cap on >> the amount of memory that a virtio-mem device can offer to a >> guest. But we are already able to express this info using <size/> >> under <target/>. >> >> Therefore, we need only two more elements to cover 'block-size' >> and 'requested-size' attributes. This is the XML I've came up >> with: >> >> <memory model='virtio-mem'> >> <source> >> <nodemask>1-3</nodemask> >> <pagesize unit='KiB'>2048</pagesize> >> </source> >> <target> >> <size unit='KiB'>2097152</size> >> <node>0</node> >> <block unit='KiB'>2048</block> >> <requested unit='KiB'>1048576</requested> >> </target> >> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> >> </memory> >> >> I hope by now it is obvious that: >> >> 1) 'requested-size' must be an integer multiple of >> 'block-size', and >> 2) virtio-mem-pci device goes onto PCI bus and thus needs PCI >> address. >> >> Then there is a limitation that the minimal 'block-size' is >> transparent huge page size (I'll leave this without explanation). >> >> Signed-off-by: Michal Privoznik <mprivozn@xxxxxxxxxx> >> --- >> docs/formatdomain.rst | 35 ++++++++-- >> docs/schemas/domaincommon.rng | 11 ++++ >> src/conf/domain_conf.c | 53 ++++++++++++++- >> src/conf/domain_conf.h | 3 + >> src/conf/domain_validate.c | 39 +++++++++++ >> src/qemu/qemu_alias.c | 1 + >> src/qemu/qemu_command.c | 1 + >> src/qemu/qemu_domain.c | 10 +++ >> src/qemu/qemu_domain_address.c | 37 ++++++++--- >> src/qemu/qemu_validate.c | 8 +++ >> src/security/security_apparmor.c | 1 + >> src/security/security_dac.c | 2 + >> src/security/security_selinux.c | 2 + >> tests/domaincapsmock.c | 9 +++ >> .../memory-hotplug-virtio-mem.xml | 66 +++++++++++++++++++ >> ...emory-hotplug-virtio-mem.x86_64-latest.xml | 1 + >> tests/qemuxml2xmltest.c | 1 + >> 17 files changed, 264 insertions(+), 16 deletions(-) >> create mode 100644 tests/qemuxml2argvdata/memory-hotplug-virtio-mem.xml >> create mode 120000 tests/qemuxml2xmloutdata/memory-hotplug-virtio-mem.x86_64-latest.xml >> >> diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst >> index af540391db..2938758ec2 100644 >> --- a/docs/formatdomain.rst >> +++ b/docs/formatdomain.rst >> @@ -7267,6 +7267,18 @@ Example: usage of the memory devices >> <size unit='KiB'>524288</size> >> </target> >> </memory> >> + <memory model='virtio-mem'> >> + <source> >> + <nodemask>1-3</nodemask> >> + <pagesize unit='KiB'>2048</pagesize> >> + </source> >> + <target> >> + <size unit='KiB'>2097152</size> >> + <node>0</node> >> + <block unit='KiB'>2048</block> >> + <requested unit='KiB'>1048576</requested> >> + </target> >> + </memory> >> </devices> >> ... >> >> @@ -7274,7 +7286,8 @@ Example: usage of the memory devices >> Provide ``dimm`` to add a virtual DIMM module to the guest. :since:`Since >> 1.2.14` Provide ``nvdimm`` model that adds a Non-Volatile DIMM module. >> :since:`Since 3.2.0` Provide ``virtio-pmem`` model to add a paravirtualized >> - persistent memory device. :since:`Since 7.1.0` >> + persistent memory device. :since:`Since 7.1.0` Provide ``virtio-mem`` model >> + to add paravirtualized memory device. :since: `Since 7.1.0` >> >> ``access`` >> An optional attribute ``access`` ( :since:`since 3.2.0` ) that provides >> @@ -7297,10 +7310,11 @@ Example: usage of the memory devices >> allowed only for ``model='nvdimm'`` for pSeries guests. :since:`Since 6.2.0` >> >> ``source`` >> - For model ``dimm`` this element is optional and allows to fine tune the >> - source of the memory used for the given memory device. If the element is not >> - provided defaults configured via ``numatune`` are used. If ``dimm`` is >> - provided, then the following optional elements can be provided as well: >> + For model ``dimm`` and model ``virtio-mem`` this element is optional and >> + allows to fine tune the source of the memory used for the given memory >> + device. If the element is not provided defaults configured via ``numatune`` >> + are used. If the element is provided, then the following optional elements >> + can be provided: >> >> ``pagesize`` >> This element can be used to override the default host page size used for >> @@ -7366,6 +7380,17 @@ Example: usage of the memory devices >> so other backend types should use the ``readonly`` element. :since:`Since >> 5.0.0` >> >> + ``block`` >> + For ``virtio-mem`` only. >> + The size of an individual block, granularity of division of memory module. >> + Must be power of two and at least equal to size of a transparent hugepage >> + (2MiB on x84_64). The default is hypervisor dependant. > > I don't think that 'dependant' is wrong in this context but 'dependent' is more > common. > >> + >> + ``requested`` >> + For ``virtio-mem`` only. >> + The total size of blocks exposed to the guest. Must respect ``block`` >> + granularity. >> + >> :anchor:`<a id="elementsIommu"/>` >> >> IOMMU devices >> diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng >> index a4bddcf132..5bc120073e 100644 >> --- a/docs/schemas/domaincommon.rng >> +++ b/docs/schemas/domaincommon.rng >> @@ -6020,6 +6020,7 @@ >> <value>dimm</value> >> <value>nvdimm</value> >> <value>virtio-pmem</value> >> + <value>virtio-mem</value> >> </choice> >> </attribute> >> <optional> >> @@ -6104,6 +6105,16 @@ >> <ref name="unsignedInt"/> >> </element> >> </optional> >> + <optional> >> + <element name="block"> >> + <ref name="scaledInteger"/> >> + </element> >> + </optional> >> + <optional> >> + <element name="requested"> >> + <ref name="scaledInteger"/> >> + </element> >> + </optional> >> <optional> >> <element name="label"> >> <element name="size"> >> diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c >> index dab4f10326..f8c5a40b24 100644 >> --- a/src/conf/domain_conf.c >> +++ b/src/conf/domain_conf.c >> @@ -1310,6 +1310,7 @@ VIR_ENUM_IMPL(virDomainMemoryModel, >> "dimm", >> "nvdimm", >> "virtio-pmem", >> + "virtio-mem", >> ); >> >> VIR_ENUM_IMPL(virDomainShmemModel, >> @@ -5359,6 +5360,7 @@ virDomainMemoryDefPostParse(virDomainMemoryDefPtr mem, >> } >> break; >> >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM: >> case VIR_DOMAIN_MEMORY_MODEL_DIMM: >> case VIR_DOMAIN_MEMORY_MODEL_NONE: >> case VIR_DOMAIN_MEMORY_MODEL_LAST: >> @@ -15322,6 +15324,7 @@ virDomainMemorySourceDefParseXML(xmlNodePtr node, >> >> switch (def->model) { >> case VIR_DOMAIN_MEMORY_MODEL_DIMM: >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM: >> if (virDomainParseMemory("./pagesize", "./pagesize/@unit", ctxt, >> &def->pagesize, false, false) < 0) >> return -1; >> @@ -15388,7 +15391,8 @@ virDomainMemoryTargetDefParseXML(xmlNodePtr node, >> &def->size, true, false) < 0) >> return -1; >> >> - if (def->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM) { >> + switch (def->model) { >> + case VIR_DOMAIN_MEMORY_MODEL_NVDIMM: >> if (virDomainParseMemory("./label/size", "./label/size/@unit", ctxt, >> &def->labelsize, false, false) < 0) >> return -1; >> @@ -15407,6 +15411,23 @@ virDomainMemoryTargetDefParseXML(xmlNodePtr node, >> >> if (virXPathBoolean("boolean(./readonly)", ctxt)) >> def->readonly = true; >> + break; >> + >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM: >> + if (virDomainParseMemory("./block", "./block/@unit", ctxt, >> + &def->blocksize, false, false) < 0) >> + return -1; >> + >> + if (virDomainParseMemory("./requested", "./requested/@unit", ctxt, >> + &def->requestedsize, false, false) < 0) >> + return -1; >> + break; >> + >> + case VIR_DOMAIN_MEMORY_MODEL_NONE: >> + case VIR_DOMAIN_MEMORY_MODEL_DIMM: >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_PMEM: >> + case VIR_DOMAIN_MEMORY_MODEL_LAST: >> + break; >> } >> >> return 0; >> @@ -17214,11 +17235,14 @@ virDomainMemoryFindByDefInternal(virDomainDefPtr def, >> /* target info -> always present */ >> if (tmp->model != mem->model || >> tmp->targetNode != mem->targetNode || >> - tmp->size != mem->size) >> + tmp->size != mem->size || >> + tmp->blocksize != mem->blocksize || >> + tmp->requestedsize != mem->requestedsize) >> continue; >> >> switch (mem->model) { >> case VIR_DOMAIN_MEMORY_MODEL_DIMM: >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM: >> /* source stuff -> match with device */ >> if (tmp->pagesize != mem->pagesize) >> continue; >> @@ -22784,6 +22808,22 @@ virDomainMemoryDefCheckABIStability(virDomainMemoryDefPtr src, >> return false; >> } >> >> + if (src->blocksize != dst->blocksize) { >> + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, >> + _("Target memory device block size '%llu' doesn't match " >> + "source memory device block size '%llu'"), >> + dst->blocksize, src->blocksize); >> + return false; >> + } >> + >> + if (src->requestedsize != dst->requestedsize) { >> + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, >> + _("Target memory device requested size '%llu' doesn't match " >> + "source memory device requested size '%llu'"), >> + dst->requestedsize, src->requestedsize); >> + return false; >> + } >> + >> if (src->model == VIR_DOMAIN_MEMORY_MODEL_NVDIMM) { >> if (src->labelsize != dst->labelsize) { >> virReportError(VIR_ERR_CONFIG_UNSUPPORTED, >> @@ -26507,6 +26547,7 @@ virDomainMemorySourceDefFormat(virBufferPtr buf, >> >> switch (def->model) { >> case VIR_DOMAIN_MEMORY_MODEL_DIMM: >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM: >> if (def->sourceNodes) { >> if (!(bitmap = virBitmapFormat(def->sourceNodes))) >> return -1; >> @@ -26563,6 +26604,14 @@ virDomainMemoryTargetDefFormat(virBufferPtr buf, >> if (def->readonly) >> virBufferAddLit(&childBuf, "<readonly/>\n"); >> >> + if (def->blocksize) { >> + virBufferAsprintf(&childBuf, "<block unit='KiB'>%llu</block>\n", >> + def->blocksize); >> + >> + virBufferAsprintf(&childBuf, "<requested unit='KiB'>%llu</requested>\n", >> + def->requestedsize); >> + } >> + >> virXMLFormatElement(buf, "target", NULL, &childBuf); >> } >> >> diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h >> index 95ad052891..5d89ecfe9d 100644 >> --- a/src/conf/domain_conf.h >> +++ b/src/conf/domain_conf.h >> @@ -2308,6 +2308,7 @@ typedef enum { >> VIR_DOMAIN_MEMORY_MODEL_DIMM, /* dimm hotpluggable memory device */ >> VIR_DOMAIN_MEMORY_MODEL_NVDIMM, /* nvdimm memory device */ >> VIR_DOMAIN_MEMORY_MODEL_VIRTIO_PMEM, /* virtio-pmem memory device */ >> + VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM, /* virtio-mem memory device */ >> >> VIR_DOMAIN_MEMORY_MODEL_LAST >> } virDomainMemoryModel; >> @@ -2328,6 +2329,8 @@ struct _virDomainMemoryDef { >> int targetNode; >> unsigned long long size; /* kibibytes */ >> unsigned long long labelsize; /* kibibytes; valid only for NVDIMM */ >> + unsigned long long blocksize; /* kibibytes; valid only for VIRTIO_MEM */ >> + unsigned long long requestedsize; /* kibibytes; valid only for VIRTIO_MEM */ >> bool readonly; /* valid only for NVDIMM */ >> >> /* required for QEMU NVDIMM ppc64 support */ >> diff --git a/src/conf/domain_validate.c b/src/conf/domain_validate.c >> index 649fc335ac..b5a0c09468 100644 >> --- a/src/conf/domain_validate.c >> +++ b/src/conf/domain_validate.c >> @@ -25,6 +25,7 @@ >> #include "virconftypes.h" >> #include "virlog.h" >> #include "virutil.h" >> +#include "virhostmem.h" >> >> #define VIR_FROM_THIS VIR_FROM_DOMAIN >> >> @@ -1389,6 +1390,8 @@ static int >> virDomainMemoryDefValidate(const virDomainMemoryDef *mem, >> const virDomainDef *def) >> { >> + unsigned long long thpSize; >> + >> switch (mem->model) { >> case VIR_DOMAIN_MEMORY_MODEL_NVDIMM: >> if (!mem->nvdimmPath) { >> @@ -1442,6 +1445,42 @@ virDomainMemoryDefValidate(const virDomainMemoryDef *mem, >> _("virtio-pmem does not support NUMA nodes")); >> return -1; >> } >> + break; >> + >> + case VIR_DOMAIN_MEMORY_MODEL_VIRTIO_MEM: >> + if (mem->requestedsize > mem->size) { >> + virReportError(VIR_ERR_XML_DETAIL, "%s", >> + _("requested size must be smaller than @size")); >> + return -1; >> + } >> + >> + if (!VIR_IS_POW2(mem->blocksize)) { >> + virReportError(VIR_ERR_XML_DETAIL, "%s", >> + _("block size must be a power of two")); >> + return -1; >> + } >> + >> + if (virHostMemGetTHPSize(&thpSize) < 0) { >> + /* We failed to get THP size, fall back to a sane default. On >> + * almost every architecture the size will be 2MiB, except for some >> + * funky arches like sparc and m68k. Use 2MiB and refine later if >> + * somebody complains. */ >> + thpSize = 2048; > > FWIW, a Power 9 server uses 2MiB too: > > $ cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size > 2097152 > > > I don't think you should worry about too much since x86 is the only arch that is > supporting virtio-mem (for now). So, in QEMU we use the following logic right now: get_block_size() { block_size = backend_pagesize(); if (block_size == NATIVE_PAGE_SIZE) return MAX(block_size, native_thp_size()); return MAX(block_size, 1 * MiB); } detect_thp_size() { thp_size = read_from_file(); if (!thp_size || thp_size > 16 * MiB) { if (s390x) return 1 * MiB; return 2 * MiB; } return thp_size; } Especially, we also cap big block sizes (esp. arm64 with currently 512 MiB THP), as we prefer flexibility at this point. So yes, on a x86-4 *host* we'll usually end up 2 MiB in QEMU. On arm64 it can be quite different. > > > Reviewed-by: Daniel Henrique Barboza <danielhb413@xxxxxxxxx> -- Thanks, David / dhildenb