On Wed, Jun 02, 2021 at 04:51:41PM +0100, Russell King (Oracle) wrote: > On Wed, Jun 02, 2021 at 04:54:17PM +0300, Mike Rapoport wrote: > > On Wed, Jun 02, 2021 at 11:15:21AM +0100, Russell King (Oracle) wrote: > > > On Wed, Jun 02, 2021 at 11:33:10AM +0300, Mike Rapoport wrote: > > > > On Tue, Jun 01, 2021 at 02:54:15PM +0100, Russell King (Oracle) wrote: > > > > > If I look at one of my kernels: > > > > > > > > > > c0008000 T _text > > > > > c0b5b000 R __end_rodata > > > > > ... exception and unwind tables live here ... > > > > > c0c00000 T __init_begin > > > > > c0e00000 D _sdata > > > > > c0e68870 D _edata > > > > > c0e68870 B __bss_start > > > > > c0e995d4 B __bss_stop > > > > > c0e995d4 B _end > > > > > > > > > > So the original covers _text..__init_begin-1 which includes the > > > > > exception and unwind tables. Your version above omits these, which > > > > > leaves them exposed. > > > > > > > > Right, this needs to be fixed. Is there any reason the exception and unwind > > > > tables cannot be placed between _sdata and _edata? > > > > > > > > It seems to me that they were left outside for purely historical reasons. > > > > Commit ee951c630c5c ("ARM: 7568/1: Sort exception table at compile time") > > > > moved the exception tables out of .data section before _sdata existed. > > > > Commit 14c4a533e099 ("ARM: 8583/1: mm: fix location of _etext") moved > > > > _etext before the unwind tables and didn't bother to put them into data or > > > > rodata areas. > > > > > > You can not assume that all sections will be between these symbols. This > > > isn't specific to 32-bit ARM. If you look at x86's vmlinux.lds.in, you > > > will see that BUG_TABLE and ORC_UNWIND_TABLE are after _edata, along > > > with many other undiscarded sections before __bss_start. > > > > But if you look at x86's setup_arch() all these never make it to the > > resource tree. So there are holes in /proc/iomem between the kernel > > resources. > > Also true. However, my point was to counter your claim that these > sections should be part of the .text/.data/.rodata etc sections in the > output vmlinux. > > There is, however, a more important point. The __ex_table section > must exist and be separate from the .text/.data/.rodata sections in > the output ELF file, as sorttable (the exception table sorter) relies > on this to be able to find the table and sort it. > > So, it isn't entirely "for historical reasons" as you said two messages > ago. Back then when __ex_table was moved from .data section, _sdata and _edata were part of the .data section. Today they are not. So something like the patch below will ensure for instance that __ex_table would be a part of "Kernel data" in /proc/iomem without moving it to the .data section: diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S index f7f4620d59c3..2991feceab31 100644 --- a/arch/arm/kernel/vmlinux.lds.S +++ b/arch/arm/kernel/vmlinux.lds.S @@ -72,13 +72,6 @@ SECTIONS RO_DATA(PAGE_SIZE) - . = ALIGN(4); - __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { - __start___ex_table = .; - ARM_MMU_KEEP(*(__ex_table)) - __stop___ex_table = .; - } - #ifdef CONFIG_ARM_UNWIND ARM_UNWIND_SECTIONS #endif @@ -143,6 +136,14 @@ SECTIONS __init_end = .; _sdata = .; + + . = ALIGN(4); + __ex_table : AT(ADDR(__ex_table) - LOAD_OFFSET) { + __start___ex_table = .; + ARM_MMU_KEEP(*(__ex_table)) + __stop___ex_table = .; + } + RW_DATA(L1_CACHE_BYTES, PAGE_SIZE, THREAD_SIZE) _edata = .; > Now, bear in mind that /proc/iomem is a user API, one which userspace > depends on. If we start going around making /proc/iomem report stuff > like kernel boot time reservations as "reserved" memory, we will end up > breaking the kexec tooling on some platforms. For example, kexec > tooling for 32-bit ARM parses /proc/iomem, looking for "System RAM", > "System RAM (boot alias)" and "reserved" regions. > > So, I think changes to make this "more consistent" come with high > risk. I agree there is a risk but I don't think it's high. It does not look like the minor changes in "reserved" reporting in /proc/iomem will break kexec tooling. Anyway the amount of reserved and free memory depends on a particular system, kernel version, configuration and command line. I have no intention to report kernel boot time reservations to /proc/iomem on architectures that do not report them there today, although this also does not seem like a significant factor. On the other hand, making /proc/iomem reporting consistent among architectures will allow to reduce complexity of both the kernel and kexec tools in the long run. -- Sincerely yours, Mike.