Ard, Bhupesh, Thank you for your comments. On Tue, Mar 20, 2018 at 01:18:34AM +0530, Bhupesh Sharma wrote: > On 03/14/2018 01:59 PM, AKASHI Takahiro wrote: > >In the last couples of months, there were some problems reported [1],[2] > >around arm64 kexec/kdump. Where those phenomenon look different, > >the root cause would be that kexec/kdump doesn't take into account > >crucial "reserved" regions of system memory and unintentionally corrupts > >them. > > > >Given that kexec-tools looks for all the information by seeking the file, > >/proc/iomem, the first step to address said problems is to expand this file's > >format so that it will have enough information about system memory and > >its usage. > > > >Attached is my experimental code: With this patch applied, /proc/iomem sees > >something like the below: > > > >(format A) > >40000000-5871ffff : System RAM > > 40080000-40f1ffff : Kernel code > > 41040000-411e8fff : Kernel data > > 54400000-583fffff : Crash kernel > > 58590000-585effff : EFI Resources > > 58700000-5871ffff : EFI Resources > >58720000-58b5ffff : System RAM > > 58720000-58b5ffff : EFI Resources > >58b60000-5be3ffff : System RAM > > 58b61018-58b61947 : EFI Memory Map > > 59a7b118-59a7b667 : EFI Configuration Tables > >5be40000-5becffff : System RAM <== (A-1) > > 5be40000-5becffff : EFI Resources > >5bed0000-5bedffff : System RAM > >5bee0000-5bffffff : System RAM > > 5bee0000-5bffffff : EFI Resources > >5c000000-5fffffff : System RAM > >8000000000-ffffffffff : PCI Bus 0000:00 > > > >Meanwhile, the workaround I suggested in [3] gave us a simpler view: > > > >(format B) > >40000000-5871ffff : System RAM > > 40080000-40f1ffff : Kernel code > > 41040000-411e9fff : Kernel data > > 54400000-583fffff : Crash kernel > > 58590000-585effff : reserved > > 58700000-5871ffff : reserved > >58720000-58b5ffff : reserved > >58b60000-5be3ffff : System RAM > > 58b61000-58b61fff : reserved > > 59a7b318-59a7b867 : reserved > >5be40000-5becffff : reserved <== (B-1) > >5bed0000-5bedffff : System RAM > >5bee0000-5bffffff : reserved > >5c000000-5fffffff : System RAM > > 5ec00000-5edfffff : reserved > >8000000000-ffffffffff : PCI Bus 0000:00 > > > >Here all the regions to be protected are named just "reserved" whether > >they are NOMAP regions or simply-memblock_reserve'd. > > Personally, I like this format over the other two proposed. > > However, I would suggest adding "reserved" regions as reserved (NOMAP) > regions and reserved (MAP'ed) regions (or a similar meaning wording for the > same). Okay. > >They are not very > >useful for anything but kexec/kdump which knows what they mean. > > I disagree. I have found the naming does help in debugging issues > in the crashkernel itself which cause an early panic in the crashkernel. > > Knowing the type of entry in '/proc/iomem' really helps in understanding > what the kexec-tools might have picked up and sent as a part of the > "linux,usable-memory" range property to the crashkernel. You're still talking about kexec/kdump. My point was that "reserved" doesn't convey lots of meanings to other features/applications. Anyway, nobody seems to agree to giving specific names to those regions. > >Alternatively, we may want to give them more specific names, based on > >related efi memory map descriptors and else, that will characterize > >their contents: > > > >(format C) > >40000000-5871ffff : System RAM > > 40080000-40f1ffff : Kernel code > > 41040000-411e9fff : Kernel data > > 54400000-583fffff : Crash kernel > > 58590000-585effff : ACPI Reclaim Memory > > 58700000-5871ffff : ACPI Reclaim Memory > >58720000-58b5ffff : System RAM > > 58720000-5878ffff : Runtime Data > > 58790000-587dffff : Runtime Code > > 587e0000-5882ffff : Runtime Data > > 58830000-5887ffff : Runtime Code > > 58880000-588cffff : Runtime Data > > 588d0000-5891ffff : Runtime Code > > 58920000-5896ffff : Runtime Data > > 58970000-589bffff : Runtime Code > > 589c0000-58a5ffff : Runtime Data > > 58a60000-58abffff : Runtime Code > > 58ac0000-58b0ffff : Runtime Data > > 58b10000-58b5ffff : Runtime Code > >58b60000-5be3ffff : System RAM > > 58b61000-58b61fff : EFI Memory Map > > 59a7b118-59a7b667 : EFI Memory Attributes Table > >5be40000-5becffff : System RAM > > 5be40000-5becffff : Runtime Code > >5bed0000-5bedffff : System RAM > >5bee0000-5bffffff : System RAM > > 5bee0000-5bffffff : Runtime Data > >5c000000-5fffffff : System RAM > >8000000000-ffffffffff : PCI Bus 0000:00 > > > >I once created a patch for this format, but it looks quite noisy and > >names are a sort of mixture of memory attributes( ACPI Reclaim memory, > >Conventional Memory, Persistent Memory etc.) vs. > >function/usages ([Loader|Boot Service|Runtime] Code/Data). > >(As a matter of fact, (C-1) consists of various ACPI tables.) > >Anyhow, they seem not so useful for most of other applications. > > > >Those observations lead to format A, where some entries with the same > >attributes are squeezed into a single entry under a simple name if they > >are neighbouring. > > > > > >So my questions here are: > > > >1. Which format, A, B, or C, is the most appropriate for the moment? > > or any other suggestions? > > > >Currently, there is a inconsistent view between (A) and the mainline's: > >see (A-1) and (B-1). If this is really a matter, I can fix it. > >Kexec-tools can be easily modified to accept both formats, though. > > > > > >2. How should we determine which regions be exported in /proc/iomem? > > > > a. Trust all the memblock_reserve'd regions as my previous patch [3] does. > > > > As I said, it's a kind of "overkill." Some of regions, say fdt, are > > not required to be preserved across kexec. > > > I think we should preserve all the memblock_reserve'd regions. So +1 on this > approach from my side. I believe it might help avoid issues we have seen in > the past with 'kexec-tools' _incorrectly_ determining which regions to pick > from the '/proc/iomem'. As I said in my reply to Ard's comment, I now know *overkill* is not a big issue and I will go for this approach. > If every memblock_reserve'd region is exported in /proc/iomem', its easier > to debug issues in the 'kexec-tools' which might have cause the early > crashkernel to panic and we can exclude primary kernel as a potential > suspect for causing the same. After thinking twice, I've come up with yet another format of /proc/iomem: (format D) 40000000-5fffffff : System RAM 40080000-40f1ffff : Kernel code 41040000-411e9fff : Kernel data 54400000-583fffff : Crash kernel 58590000-585effff : reserved 58700000-5871ffff : reserved 58720000-58b5ffff : reserved (no map) 58b61000-58b61fff : reserved 59a7b118-59a7b667 : reserved 5be40000-5becffff : reserved (no map) 5bee0000-5bffffff : reserved (no map) 5ec00000-5edfffff : reserved 8000000000-ffffffffff : PCI Bus 0000:00 I think that this gives us a simpler & more intuitive view of system ram as all (firmware-)reserved regions as well as NOMAP regions are listed under *one* continuous memory resource of "System RAM" alike. (Please note that there is no change in memblock status.) In addition, I'd like to modify crash dump kernel's memory attributes as well: (format D/kdump) 40000000-5fffffff : System RAM 40000000-543fffff : reserved (no map) 54480000-5531ffff : Kernel code ;; 0x54400000 55440000-555e9fff : Kernel data ;; | "Crash kernel" above 555ea000-555ea274 : reserved ;; 0x583fffff 58400000-5858ffff : reserved (no map) 58590000-585effff : reserved 585f0000-586fffff : reserved (no map) 58700000-5871ffff : reserved 58720000-58b60fff : reserved (no map) 58b61000-58b61fff : reserved 58b62000-59a7afff : reserved (no map) 59a7b118-59a7b667 : reserved 59a7c000-5fffffff : reserved (no map) 8000000000-ffffffffff : PCI Bus 0000:00 Here all the memory regions which belong to primary kernel are actually marked NOMAP instead of being removed from memblock.memory. This view of /proc/iomem looks quite similar to format D and, I hope, it also helps us understand system ram usage on kdump. My only concern is that this format(D,D/kdump) is a bit incompatible with the current implementation, which was introduced by my commit e7cd190385d1 ("arm64: mark reserved memblock regions explicitly in iomem"), but we need some changes, anyway, in order to take into account reserved memory regions. Unless anybody has a strong objection, I will post a kernel patch, as well as kexec-tools', based on this format/approach. Thanks, -Takahiro AKASHI > Regards, > Bhupesh > > > > > b. List regions separately and exhaustively later on at a single point > > as my patch attached does. > > > > I'm afraid of any possibility that some regions may be doubly counted, > > one from efi memory map search and another from other efi/acpi code > > (various type of "tables" in most cases). > > > > c. Expand efi_mem_reserve() with an argument of "resource descriptor" and > > replace memblock_reserve() in efi code wherever necessary so as to > > maintain an export list. > > > > efi_mem_reserve() was first introduced for specific needs at kexec > > on x86, but I believe that its coverage over efi code is far from perfect. > > > >[1] http://lists.infradead.org/pipermail/linux-arm-kernel/2017-November/541831.html > >[2] http://lkml.iu.edu//hypermail/linux/kernel/1802.2/06745.html > >[3] http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04658.html > > http://lkml.iu.edu//hypermail/linux/kernel/1803.0/04659.html > > > >-Takahiro AKASHI > > > >===8<=== > >diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c > >index 30ad2f085d1f..feda5cbdc6bf 100644 > >--- a/arch/arm64/kernel/setup.c > >+++ b/arch/arm64/kernel/setup.c > >@@ -214,13 +214,8 @@ static void __init request_standard_resources(void) > > for_each_memblock(memory, region) { > > res = alloc_bootmem_low(sizeof(*res)); > >- if (memblock_is_nomap(region)) { > >- res->name = "reserved"; > >- res->flags = IORESOURCE_MEM; > >- } else { > >- res->name = "System RAM"; > >- res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; > >- } > >+ res->name = "System RAM"; > >+ res->flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY; > > res->start = __pfn_to_phys(memblock_region_memory_base_pfn(region)); > > res->end = __pfn_to_phys(memblock_region_memory_end_pfn(region)) - 1; > >@@ -239,6 +234,9 @@ static void __init request_standard_resources(void) > > request_resource(res, &crashk_res); > > #endif > > } > >+ > >+ /* Add firmware-reserved memory */ > >+ efi_arch_request_resources(); > > } > > u64 __cpu_logical_map[NR_CPUS] = { [0 ... NR_CPUS-1] = INVALID_HWID }; > >diff --git a/drivers/firmware/efi/arm-init.c b/drivers/firmware/efi/arm-init.c > >index 80d1a885def5..308143e69db4 100644 > >--- a/drivers/firmware/efi/arm-init.c > >+++ b/drivers/firmware/efi/arm-init.c > >@@ -13,8 +13,10 @@ > > #define pr_fmt(fmt) "efi: " fmt > >+#include <linux/bootmem.h> > > #include <linux/efi.h> > > #include <linux/init.h> > >+#include <linux/ioport.h> > > #include <linux/memblock.h> > > #include <linux/mm_types.h> > > #include <linux/of.h> > >@@ -280,3 +282,97 @@ static int __init register_gop_device(void) > > return PTR_ERR_OR_ZERO(pd); > > } > > subsys_initcall(register_gop_device); > >+ > >+static unsigned long __init efi_memattr_to_iores_type(u64 addr) > >+{ > >+ if (efi_mem_attributes(addr) & > >+ (EFI_MEMORY_WB|EFI_MEMORY_WT|EFI_MEMORY_WC)) > >+ return IORESOURCE_SYSTEM_RAM; > >+ else > >+ return IORESOURCE_MEM; > >+} > >+ > >+static unsigned long __init efi_type_to_iores_desc(u64 addr) > >+{ > >+ /* TODO */ > >+ return IORES_DESC_NONE; > >+} > >+ > >+static struct resource * __init _efi_arch_request_resource(u64 start, u64 end, > >+ char *name, struct resource *prev) > >+{ > >+ struct resource *conflict, *res; > >+ > >+ res = alloc_bootmem_low(sizeof(*res)); > >+ > >+ res->start = start; > >+ res->end = end; > >+ res->flags = efi_memattr_to_iores_type(res->start); > >+ res->desc = efi_type_to_iores_desc(res->start); > >+ res->name = name; > >+ > >+ conflict = request_resource_conflict(&iomem_resource, res); > >+ if (conflict) { > >+ if (prev && (prev->parent == conflict) && > >+ ((prev->end + 1) == start)) { > >+ /* merge consecutive regions */ > >+ adjust_resource(prev, prev->start, > >+ end - prev->start + 1); > >+ free_bootmem((unsigned long)res, sizeof(*res)); > >+ res = prev; > >+ } else > >+ insert_resource(conflict, res); > >+ } > >+ > >+ return res; > >+} > >+ > >+/* Kexec expects those resources to be preserved */ > >+void __init efi_arch_request_resources(void) > >+{ > >+ struct resource *res = NULL; > >+ efi_memory_desc_t *md; > >+ u64 paddr, npages, size; > >+ > >+ /* EFI Memory Map */ > >+ /* FIXME */ > >+ _efi_arch_request_resource(efi.memmap.phys_map, > >+ efi.memmap.phys_map > >+ + efi.memmap.desc_size * efi.memmap.nr_map - 1, > >+ "EFI Memory Map", res); > >+ > >+ /* generic EFI Configuration Tables */ > >+ efi_request_config_table_resources(_efi_arch_request_resource); > >+ > >+ /* architecture-specifc Configuration Tables */ > >+ if (screen_info.lfb_size) > >+ _efi_arch_request_resource(screen_info.lfb_base, > >+ screen_info.lfb_base + screen_info.lfb_size - 1, > >+ "EFI Screen Info Table", res); > >+ > >+ /* architecture-specific EFI resources */ > >+ /* FIXME */ > >+ efi_memmap_install(efi.memmap.phys_map, efi.memmap.nr_map); > >+ > >+ res = NULL; > >+ for_each_efi_memory_desc(md) { > >+ paddr = md->phys_addr; > >+ npages = md->num_pages; > >+ > >+ memrange_efi_to_native(&paddr, &npages); > >+ size = npages << PAGE_SHIFT; > >+ > >+ if (is_memory(md)) { > >+ if (!is_usable_memory(md)) > >+ res = _efi_arch_request_resource(paddr, > >+ paddr + size - 1, > >+ "EFI Resources", res); > >+ > >+ if (md->type == EFI_ACPI_RECLAIM_MEMORY) > >+ res = _efi_arch_request_resource(paddr, > >+ paddr + size - 1, > >+ "EFI Resources", res); > >+ } > >+ } > >+ efi_memmap_unmap(); > >+} > >diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > >index cd42f66a7c85..b13c9461278b 100644 > >--- a/drivers/firmware/efi/efi.c > >+++ b/drivers/firmware/efi/efi.c > >@@ -603,6 +603,33 @@ int __init efi_config_init(efi_config_table_type_t *arch_tables) > > return ret; > > } > >+void __init efi_request_config_table_resources(struct resource *(*fn)(u64 start, > >+ u64 end, char *name, struct resource *prev)) > >+{ > >+ struct resource *prev = NULL; > >+ char *name = "EFI Configuration Tables"; > >+ > >+ if (efi.config_table_size) > >+ prev = fn(efi.config_table, > >+ efi.config_table + efi.config_table_size - 1, name, > >+ prev); > >+ > >+ if (efi.mem_attr_table_size) > >+ prev = fn(efi.mem_attr_table, > >+ efi.mem_attr_table + efi.mem_attr_table_size - 1, name, > >+ prev); > >+ > >+ if (efi.esrt_size) > >+ prev = fn(efi.esrt, efi.esrt + efi.esrt_size - 1, name, prev); > >+ > >+ if (efi.tpm_log_size) > >+ prev = fn(efi.tpm_log, efi.tpm_log + efi.tpm_log_size - 1, name, > >+ prev); > >+ > >+ > >+ /* TODO: BGRT */ > >+} > >+ > > #ifdef CONFIG_EFI_VARS_MODULE > > static int __init efi_load_efivars(void) > > { > >diff --git a/drivers/firmware/efi/esrt.c b/drivers/firmware/efi/esrt.c > >index c47e0c6ec00f..61f66c139afb 100644 > >--- a/drivers/firmware/efi/esrt.c > >+++ b/drivers/firmware/efi/esrt.c > >@@ -330,6 +330,7 @@ void __init efi_esrt_init(void) > > esrt_data = (phys_addr_t)efi.esrt; > > esrt_data_size = size; > >+ efi.esrt_size = size; > > end = esrt_data + size; > > pr_info("Reserving ESRT space from %pa to %pa.\n", &esrt_data, &end); > >diff --git a/drivers/firmware/efi/memattr.c b/drivers/firmware/efi/memattr.c > >index 8986757eafaf..dc2c7608793a 100644 > >--- a/drivers/firmware/efi/memattr.c > >+++ b/drivers/firmware/efi/memattr.c > >@@ -42,6 +42,7 @@ int __init efi_memattr_init(void) > > } > > tbl_size = sizeof(*tbl) + tbl->num_entries * tbl->desc_size; > >+ efi.mem_attr_table_size = tbl_size; > > memblock_reserve(efi.mem_attr_table, tbl_size); > > set_bit(EFI_MEM_ATTR, &efi.flags); > >diff --git a/drivers/firmware/efi/tpm.c b/drivers/firmware/efi/tpm.c > >index 0cbeb3d46b18..53cfb12513fa 100644 > >--- a/drivers/firmware/efi/tpm.c > >+++ b/drivers/firmware/efi/tpm.c > >@@ -33,6 +33,7 @@ int __init efi_tpm_eventlog_init(void) > > } > > tbl_size = sizeof(*log_tbl) + log_tbl->size; > >+ efi.tpm_log_size = tbl_size; > > memblock_reserve(efi.tpm_log, tbl_size); > > early_memunmap(log_tbl, sizeof(*log_tbl)); > > return 0; > >diff --git a/include/linux/efi.h b/include/linux/efi.h > >index f5083aa72eae..9c3f8d284b36 100644 > >--- a/include/linux/efi.h > >+++ b/include/linux/efi.h > >@@ -942,11 +942,15 @@ extern struct efi { > > unsigned long fw_vendor; /* fw_vendor */ > > unsigned long runtime; /* runtime table */ > > unsigned long config_table; /* config tables */ > >+ unsigned long config_table_size; > > unsigned long esrt; /* ESRT table */ > >+ unsigned long esrt_size; > > unsigned long properties_table; /* properties table */ > > unsigned long mem_attr_table; /* memory attributes table */ > >+ unsigned long mem_attr_table_size; > > unsigned long rng_seed; /* UEFI firmware random seed */ > > unsigned long tpm_log; /* TPM2 Event Log table */ > >+ unsigned long tpm_log_size; > > efi_get_time_t *get_time; > > efi_set_time_t *set_time; > > efi_get_wakeup_time_t *get_wakeup_time; > >@@ -980,6 +984,9 @@ efi_guid_to_str(efi_guid_t *guid, char *out) > > } > > extern void efi_init (void); > >+extern void efi_arch_request_resources(void); > >+extern void efi_request_config_table_resources(struct resource * > >+ (*fn)(u64 start, u64 end, char *name, struct resource *prev)); > > extern void *efi_get_pal_addr (void); > > extern void efi_map_pal_code (void); > > extern void efi_memmap_walk (efi_freemem_callback_t callback, void *arg); > > > _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec