在 2022/7/6 23:18, guanghui.fgh 写道:
Thanks.
在 2022/7/6 21:54, Mike Rapoport 写道:
On Wed, Jul 06, 2022 at 11:04:24AM +0100, Catalin Marinas wrote:
On Tue, Jul 05, 2022 at 11:45:40PM +0300, Mike Rapoport wrote:
On Tue, Jul 05, 2022 at 06:05:01PM +0100, Catalin Marinas wrote:
On Tue, Jul 05, 2022 at 06:57:53PM +0300, Mike Rapoport wrote:
On Tue, Jul 05, 2022 at 04:34:09PM +0100, Catalin Marinas wrote:
On Tue, Jul 05, 2022 at 06:02:02PM +0300, Mike Rapoport wrote:
+void __init remap_crashkernel(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+ phys_addr_t start, end, size;
+ phys_addr_t aligned_start, aligned_end;
+
+ if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
+ return;
+
+ if (!crashk_res.end)
+ return;
+
+ start = crashk_res.start & PAGE_MASK;
+ end = PAGE_ALIGN(crashk_res.end);
+
+ aligned_start = ALIGN_DOWN(crashk_res.start, PUD_SIZE);
+ aligned_end = ALIGN(end, PUD_SIZE);
+
+ /* Clear PUDs containing crash kernel memory */
+ unmap_hotplug_range(__phys_to_virt(aligned_start),
+ __phys_to_virt(aligned_end), false, NULL);
What I don't understand is what happens if there's valid kernel data
between aligned_start and crashk_res.start (or the other end of the
range).
Data shouldn't go anywhere :)
There is
+ /* map area from PUD start to start of crash kernel with
large pages */
+ size = start - aligned_start;
+ __create_pgd_mapping(swapper_pg_dir, aligned_start,
+ __phys_to_virt(aligned_start),
+ size, PAGE_KERNEL, early_pgtable_alloc, 0);
and
+ /* map area from end of crash kernel to PUD end with large
pages */
+ size = aligned_end - end;
+ __create_pgd_mapping(swapper_pg_dir, end, __phys_to_virt(end),
+ size, PAGE_KERNEL, early_pgtable_alloc, 0);
after the unmap, so after we tear down a part of a linear map we
immediately recreate it, just with a different page size.
This all happens before SMP, so there is no concurrency at that
point.
That brief period of unmap worries me. The kernel text, data and stack
are all in the vmalloc space but any other (memblock) allocation to
this
point may be in the unmapped range before and after the crashkernel
reservation. The interrupts are off, so I think the only allocation
and
potential access that may go in this range is the page table
itself. But
it looks fragile to me.
I agree there are chances there will be an allocation from the unmapped
range.
We can make sure this won't happen, though. We can cap the memblock
allocations with memblock_set_current_limit(aligned_end) or
memblock_reserve(algined_start, aligned_end) until the mappings are
restored.
We can reserve the region just before unmapping to avoid new allocations
for the page tables but we can't do much about pages already allocated
prior to calling remap_crashkernel().
Right, this was bothering me too after I re-read you previous email.
One thing I can think of is to only remap the crash kernel memory if
it is
a part of an allocation that exactly fits into one ore more PUDs.
Say, in reserve_crashkernel() we try the memblock_phys_alloc() with
PUD_SIZE as alignment and size rounded up to PUD_SIZE. If this allocation
succeeds, we remap the entire area that now contains only memory
allocated
in reserve_crashkernel() and free the extra memory after remapping is
done.
If the large allocation fails, we fall back to the original size and
alignment and don't allow unmapping crash kernel memory in
arch_kexec_protect_crashkres().
--
Catalin
Thanks.
There is a new method.
I think we should use the patch v3(similar but need add some changes)
1.We can walk crashkernle block/section pagetable,
[[[(keep the origin block/section mapping valid]]]
rebuild the pte level page mapping for the crashkernel mem
rebuild left & right margin mem(which is in same block/section mapping
but out of crashkernel mem) with block/section mapping
2.'replace' the origin block/section mapping by new builded mapping
iterately
With this method, all the mem mapping keep valid all the time.
3.the patch v3 link:
https://lore.kernel.org/linux-mm/6dc308db-3685-4df5-506a-71f9e3794ec8@xxxxxxxxxxxxxxxxx/T/
(Need some changes)
Namely,
When rebuilding for crashkernel mem pagemapping, there is no change to
the origin mapping.
When the new mapping is ready, we replace old mapping with the new
builded mapping.
With this method, keep all mem mapping valid all the time.