James, On Sat, Jan 28, 2017 at 02:15:16AM +0900, AKASHI Takahiro wrote: > James, > > On Fri, Jan 27, 2017 at 11:19:32AM +0000, James Morse wrote: > > Hi Akashi, > > > > On 26/01/17 11:28, AKASHI Takahiro wrote: > > > On Wed, Jan 25, 2017 at 05:37:38PM +0000, James Morse wrote: > > >> On 24/01/17 08:49, AKASHI Takahiro wrote: > > >>> To protect the memory reserved for crash dump kernel once after loaded, > > >>> arch_kexec_protect_crashres/unprotect_crashres() are meant to deal with > > >>> permissions of the corresponding kernel mappings. > > >>> > > >>> We also have to > > >>> - put the region in an isolated mapping, and > > >>> - move copying kexec's control_code_page to machine_kexec_prepare() > > >>> so that the region will be completely read-only after loading. > > >> > > >> > > >>> Note that the region must reside in linear mapping and have corresponding > > >>> page structures in order to be potentially freed by shrinking it through > > >>> /sys/kernel/kexec_crash_size. > > >> > > >> Nasty! Presumably you have to build the crash region out of individual page > > >> mappings, > > > > > > This might be an alternative, but > > > > > >> so that they can be returned to the slab-allocator one page at a time, > > >> and still be able to set/clear the valid bits on the remaining chunk. > > >> (I don't see how that happens in this patch) > > > > > > As far as shrinking feature is concerned, I believe, crash_shrink_memory(), > > > which eventually calls free_reserved_page(), will take care of all the things > > > to do. I can see increased number of "MemFree" in /proc/meminfo. > > > > Except for arch specific stuff like reformatting the page tables. Maybe I've > > overlooked the way out this. What happens with this scenario: > > > > We boot with crashkernel=1G on the commandline. > > Memblock_reserve allocates a naturally aligned 1GB block of memory for the crash > > region. > > Your code in __map_memblock() calls __create_pgd_mapping() -> > > alloc_init_pud() which decides use_1G_block() looks like a good idea. > > > > Some time later, the user decides to free half of this region, > > free_reserved_page() does its thing and half of those struct page's now belong > > to the memory allocator. > > > > Now we load a kdump kernel, which causes arch_kexec_protect_crashkres() to be > > called for the 512MB region that was left. > > > > create_mapping_late() needs to split the 1GB mapping it originally made into a > > smaller table, with the first half using PAGE_KERNEL_INVALID, and the second > > half using PAGE_KERNEL. It can't do break-before-make because these pages may be > > in-use by another CPU because we gave them back to the memory allocator. (in the > > worst-possible world, that second half contains our stack!) > > Yeah, this is a horrible case. > Now I understand why we should stick with page_mapping_only option. > > > > > Making this behave more like debug_pagealloc where the region is only built of > > page-size mappings should avoid this. The smallest change to what you have is to > > always pass page_mappings_only for the kdump region. > > > > Ideally we just disable this resize feature for ARM64 and support it with some > > later kernel version, but I can't see a way of doing this without adding Kconfig > > symbols to other architectures. > > > > > > > (Please note that the region is memblock_reserve()'d at boot time.) > > > > And free_reserved_page() does nothing to update memblock, so > > memblock_is_reserved() says these pages are reserved, but in reality they > > are in use by the memory allocator. This doesn't feel right. > > Just FYI, no other architectures take care of this issue. > > (and I don't know whether the memblock is reserved or not may have > any impact after booting.) > > > (Fortunately we can override crash_free_reserved_phys_range() so this can > > probably be fixed) > > > > >> This secretly-unmapped is the sort of thing that breaks hibernate, it blindly > > >> assumes pfn_valid() means it can access the page if it wants to. Setting > > >> PG_Reserved is a quick way to trick it out of doing this, but that would leave > > >> the crash kernel region un-initialised after resume, while kexec_crash_image > > >> still has a value. > > > > > > Ouch, I didn't notice this issue. > > > > > >> I think the best fix for this is to forbid hibernate if kexec_crash_loaded() > > >> arguing these are mutually-exclusive features, and the protect crash-dump > > >> feature exists to prevent things like hibernate corrupting the crash region. > > > > > > This restriction is really painful. > > > Is there any hibernation hook that will be invoked before suspending and > > > after resuming? If so, arch_kexec_unprotect_crashkres()/protect_crashkres() > > > will be able to be called. > > > > Those calls could go in swsusp_arch_suspend() in /arch/arm64/kernel/hibernate.c, > > I will give it a try next week. I took the following test scenario: - load the crash dump kernel - echo shutdown > /sys/power/disk - echo disk > /sys/power/state - power off and on the board - reboot with resume=... - after hibernate done, echo c > /proc/sysrq-trigger - after reboot, check /proc/vmcore and everything looks to work fine. If you think I'd better try more test cases, please let me know. Thanks, -Takahiro AKASHI ===8<=== diff --git a/arch/arm64/kernel/hibernate.c b/arch/arm64/kernel/hibernate.c index fe301cbcb442..111a849333ee 100644 --- a/arch/arm64/kernel/hibernate.c +++ b/arch/arm64/kernel/hibernate.c @@ -16,6 +16,7 @@ */ #define pr_fmt(x) "hibernate: " x #include <linux/cpu.h> +#include <linux/kexec.h> #include <linux/kvm_host.h> #include <linux/mm.h> #include <linux/pm.h> @@ -289,6 +290,12 @@ int swsusp_arch_suspend(void) local_dbg_save(flags); if (__cpu_suspend_enter(&state)) { +#ifdef CONFIG_KEXEC_CORE + /* make the crash dump kernel region mapped */ + if (kexec_crash_image) + arch_kexec_unprotect_crashkres(); +#endif + sleep_cpu = smp_processor_id(); ret = swsusp_save(); } else { @@ -300,6 +307,12 @@ int swsusp_arch_suspend(void) if (el2_reset_needed()) dcache_clean_range(__hyp_idmap_text_start, __hyp_idmap_text_end); +#ifdef CONFIG_KEXEC_CORE + /* make the crash dump kernel region unmapped */ + if (kexec_crash_image) + arch_kexec_protect_crashkres(); +#endif + /* * Tell the hibernation core that we've just restored * the memory