3.4-stable review patch. If anyone has any objections, please let me know. ------------------ From: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> commit 2e2fb75475c2fc74c98100f1468c8195fee49f3b upstream. When the Xen hypervisor boots a PV kernel it hands it two pieces of information: nr_pages and a made up E820 entry. The nr_pages value defines the range from zero to nr_pages of PFNs which have a valid Machine Frame Number (MFN) underneath it. The E820 mirrors that (with the VGA hole): BIOS-provided physical RAM map: Xen: 0000000000000000 - 00000000000a0000 (usable) Xen: 00000000000a0000 - 0000000000100000 (reserved) Xen: 0000000000100000 - 0000000080800000 (usable) The fun comes when a PV guest that is run with a machine E820 - that can either be the initial domain or a PCI PV guest, where the E820 looks like the normal thing: BIOS-provided physical RAM map: Xen: 0000000000000000 - 000000000009e000 (usable) Xen: 000000000009ec00 - 0000000000100000 (reserved) Xen: 0000000000100000 - 0000000020000000 (usable) Xen: 0000000020000000 - 0000000020200000 (reserved) Xen: 0000000020200000 - 0000000040000000 (usable) Xen: 0000000040000000 - 0000000040200000 (reserved) Xen: 0000000040200000 - 00000000bad80000 (usable) Xen: 00000000bad80000 - 00000000badc9000 (ACPI NVS) .. With that overlaying the nr_pages directly on the E820 does not work as there are gaps and non-RAM regions that won't be used by the memory allocator. The 'xen_release_chunk' helps with that by punching holes in the P2M (PFN to MFN lookup tree) for those regions and tells us that: Freeing 20000-20200 pfn range: 512 pages freed Freeing 40000-40200 pfn range: 512 pages freed Freeing bad80-badf4 pfn range: 116 pages freed Freeing badf6-bae7f pfn range: 137 pages freed Freeing bb000-100000 pfn range: 282624 pages freed Released 283999 pages of unused memory Those 283999 pages are subtracted from the nr_pages and are returned to the hypervisor. The end result is that the initial domain boots with 1GB less memory as the nr_pages has been subtracted by the amount of pages residing within the PCI hole. It can balloon up to that if desired using 'xl mem-set 0 8092', but the balloon driver is not always compiled in for the initial domain. This patch, implements the populate hypercall (XENMEM_populate_physmap) which increases the the domain with the same amount of pages that were released. The other solution (that did not work) was to transplant the MFN in the P2M tree - the ones that were going to be freed were put in the E820_RAM regions past the nr_pages. But the modifications to the M2P array (the other side of creating PTEs) were not carried away. As the hypervisor is the only one capable of modifying that and the only two hypercalls that would do this are: the update_va_mapping (which won't work, as during initial bootup only PFNs up to nr_pages are mapped in the guest) or via the populate hypercall. The end result is that the kernel can now boot with the nr_pages without having to subtract the 283999 pages. On a 8GB machine, with various dom0_mem= parameters this is what we get: no dom0_mem -Memory: 6485264k/9435136k available (5817k kernel code, 1136060k absent, 1813812k reserved, 2899k data, 696k init) +Memory: 7619036k/9435136k available (5817k kernel code, 1136060k absent, 680040k reserved, 2899k data, 696k init) dom0_mem=3G -Memory: 2616536k/9435136k available (5817k kernel code, 1136060k absent, 5682540k reserved, 2899k data, 696k init) +Memory: 2703776k/9435136k available (5817k kernel code, 1136060k absent, 5595300k reserved, 2899k data, 696k init) dom0_mem=max:3G -Memory: 2696732k/4281724k available (5817k kernel code, 1136060k absent, 448932k reserved, 2899k data, 696k init) +Memory: 2702204k/4281724k available (5817k kernel code, 1136060k absent, 443460k reserved, 2899k data, 696k init) And the 'xm list' or 'xl list' now reflect what the dom0_mem= argument is. Acked-by: David Vrabel <david.vrabel@xxxxxxxxxx> [v2: Use populate hypercall] [v3: Remove debug printks] [v4: Simplify code] Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> Signed-off-by: Daniel Kiper <daniel.kiper@xxxxxxxxxx> Tested-by: Daniel Kiper <daniel.kiper@xxxxxxxxxx> Tested-by: Konrad Rzeszutek Wilk <konrad.wilk@xxxxxxxxxx> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> --- arch/x86/xen/setup.c | 116 +++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 112 insertions(+), 4 deletions(-) --- a/arch/x86/xen/setup.c +++ b/arch/x86/xen/setup.c @@ -27,7 +27,6 @@ #include <xen/interface/memory.h> #include <xen/interface/physdev.h> #include <xen/features.h> - #include "xen-ops.h" #include "vdso.h" @@ -127,7 +126,105 @@ static unsigned long __init xen_release_ return len; } +static unsigned long __init xen_populate_physmap(unsigned long start, + unsigned long end) +{ + struct xen_memory_reservation reservation = { + .address_bits = 0, + .extent_order = 0, + .domid = DOMID_SELF + }; + unsigned long len = 0; + int ret; + + for (pfn = start; pfn < end; pfn++) { + unsigned long frame; + + /* Make sure pfn does not exists to start with */ + if (pfn_to_mfn(pfn) != INVALID_P2M_ENTRY) + continue; + + frame = pfn; + set_xen_guest_handle(reservation.extent_start, &frame); + reservation.nr_extents = 1; + + ret = HYPERVISOR_memory_op(XENMEM_populate_physmap, &reservation); + WARN(ret != 1, "Failed to populate pfn %lx err=%d\n", pfn, ret); + if (ret == 1) { + if (!early_set_phys_to_machine(pfn, frame)) { + set_xen_guest_handle(reservation.extent_start, &frame); + reservation.nr_extents = 1; + ret = HYPERVISOR_memory_op(XENMEM_decrease_reservation, + &reservation); + break; + } + len++; + } else + break; + } + if (len) + printk(KERN_INFO "Populated %lx-%lx pfn range: %lu pages added\n", + start, end, len); + return len; +} +static unsigned long __init xen_populate_chunk( + const struct e820entry *list, size_t map_size, + unsigned long max_pfn, unsigned long *last_pfn, + unsigned long credits_left) +{ + const struct e820entry *entry; + unsigned int i; + unsigned long done = 0; + unsigned long dest_pfn; + + for (i = 0, entry = list; i < map_size; i++, entry++) { + unsigned long credits = credits_left; + unsigned long s_pfn; + unsigned long e_pfn; + unsigned long pfns; + long capacity; + + if (credits <= 0) + break; + + if (entry->type != E820_RAM) + continue; + + e_pfn = PFN_UP(entry->addr + entry->size); + + /* We only care about E820 after the xen_start_info->nr_pages */ + if (e_pfn <= max_pfn) + continue; + + s_pfn = PFN_DOWN(entry->addr); + /* If the E820 falls within the nr_pages, we want to start + * at the nr_pages PFN. + * If that would mean going past the E820 entry, skip it + */ + if (s_pfn <= max_pfn) { + capacity = e_pfn - max_pfn; + dest_pfn = max_pfn; + } else { + /* last_pfn MUST be within E820_RAM regions */ + if (*last_pfn && e_pfn >= *last_pfn) + s_pfn = *last_pfn; + capacity = e_pfn - s_pfn; + dest_pfn = s_pfn; + } + /* If we had filled this E820_RAM entry, go to the next one. */ + if (capacity <= 0) + continue; + + if (credits > capacity) + credits = capacity; + pfns = xen_populate_physmap(dest_pfn, dest_pfn + credits); + done += pfns; + credits_left -= pfns; + *last_pfn = (dest_pfn + pfns); + } + return done; +} static unsigned long __init xen_set_identity_and_release( const struct e820entry *list, size_t map_size, unsigned long nr_pages) { @@ -150,7 +247,6 @@ static unsigned long __init xen_set_iden */ for (i = 0, entry = list; i < map_size; i++, entry++) { phys_addr_t end = entry->addr + entry->size; - if (entry->type == E820_RAM || i == map_size - 1) { unsigned long start_pfn = PFN_DOWN(start); unsigned long end_pfn = PFN_UP(end); @@ -236,7 +332,9 @@ char * __init xen_memory_setup(void) int rc; struct xen_memory_map memmap; unsigned long max_pages; + unsigned long last_pfn = 0; unsigned long extra_pages = 0; + unsigned long populated; int i; int op; @@ -287,9 +385,20 @@ char * __init xen_memory_setup(void) */ xen_released_pages = xen_set_identity_and_release( map, memmap.nr_entries, max_pfn); - extra_pages += xen_released_pages; /* + * Populate back the non-RAM pages and E820 gaps that had been + * released. */ + populated = xen_populate_chunk(map, memmap.nr_entries, + max_pfn, &last_pfn, xen_released_pages); + + extra_pages += (xen_released_pages - populated); + + if (last_pfn > max_pfn) { + max_pfn = min(MAX_DOMAIN_PAGES, last_pfn); + mem_end = PFN_PHYS(max_pfn); + } + /* * Clamp the amount of extra memory to a EXTRA_MEM_RATIO * factor the base size. On non-highmem systems, the base * size is the full initial memory allocation; on highmem it @@ -302,7 +411,6 @@ char * __init xen_memory_setup(void) */ extra_pages = min(EXTRA_MEM_RATIO * min(max_pfn, PFN_DOWN(MAXMEM)), extra_pages); - i = 0; while (i < memmap.nr_entries) { u64 addr = map[i].addr; -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html