Hi Baoquan, On Tue, Oct 30, 2018 at 7:37 AM Baoquan He <bhe@xxxxxxxxxx> wrote: > > On 10/29/18 at 04:07pm, Bhupesh Sharma wrote: > > I am sorry, I understand that the commit log is a bit long and > > Yes, it's too long. Please summarize well so that it can save reviewers' > time. I have tried to include all the points in the log which might be relevant to folks (there are user-space utility maintainers in Cc as well) who might not be aware of complete background on this (especially the kernel side of the things), while I am discussing the user-space changes with them :) > > probably this part > > is not easy to infer. Currently, I see that the 'makedumpfile' utility > > is broken with newer kernels > > (I tested on 4.19-rc8+) as we KCORE_REMAP was added to recent kernels > > thus leading to an additional section in kcore. > > [see <http://lists.infradead.org/pipermail/kexec/2018-October/021769.html> > > for details]. > > Why it's broken? Have you investigated and figured out why it's broken? > If fix, what patch will it look like? Does the patch prove it's not > worth using the current way? > > Have you thought about this in advance? Or still like before, you said > on arm64 you found different boards have different behaviour, then > makedumpfile maintainer Kazu said he investigated and found it may be > caused by KALSR. This time, for this KCORE_REMAP adding, can you help to > investigate further and give an answer to the issue you found and > raised? Ofcourse, the patchset which added vmcoreinfo into kcore was discussed and it was agreed that this was a better approach to move forward and hence accepted in mainline. Regarding the makedumpfile issue, I have already provided a detailed reply to Kazu (you are Cc'ed on the thread) and also proposed a makedumpfile approach which reads the 'page_offset_base' value from kcore (using the kernel interface provided by this patch), [on which you are Cc'ed as well]: [1]. https://www.spinics.net/lists/kexec/msg21717.html [2]. https://www.spinics.net/lists/kexec/msg21722.html Again I think we are discussing on a wrong tangent here. The idea is not limited to only makedumpfile - it affects other user-space utilities (like crash) which are used for live debugging a running kernel. > > The details of the makedumpfile utility can be seen via the man page > > [MAKEDUMPFILE(8)], > > but in short it tries to make a small DUMPFILE by compressing dump > > data or by excluding > > unnecessary pages for analysis, or both. > > > > However the bigger problem is how we export machine specific details > > from kernel-space > > to user-land in a standardized way. As I mentioned in brief in the git > > log, I was seeing > > issues when I upgrade kernels or try to bring up user-space utilities > > on newer hardware, > > as currently we use different (and often flaky approaches) to > > calculate machine specific details in > > user-space code as there used to be lack of a clear ABI between the > > kernel and user-space on how > > machine specific details would be shared. > > > > Later on, kernel commit 23c85094fe1895caefdd came, which adds > > vmcoreinfo to 'kcore', > > as an arch agnostic approach to unify the differences existing in > > exporting kernel space information > > to the user-space code and James suggested that I use the same for > > user-space purposes to fix > > the issues I was observing. > > > > > Sorry I didn't get what problem this patch is trying to fix from the > > > patch log. > > > > So, here since the 'page_offset_base' variable (which holds the start > > of direct mapping of all physical > > memory) is not exported by the x86_64 kernel to the user-space via a > > standard interface, we resort > > to calculating the same via reading PT_LOADs in user-space (as an > > example from the makedumpfile > > implementation ). Now this implementation is usually different across > > user-space utilities. > > > > Also, if the PT_LOAD ordering changes (as we saw with the newer > > kernels), this approach will need > > fixing to calculate the addresses. In addition, we normally need > > 'page_offset_base' value in user-space (and retrieve it via > > vmlinux file in another user case from the same makedumpfile code) for > > calculating the start of direct mapping of all physical > > memory specifically for KASLR boot cases. > > > > Instead, if we can export 'page_offset_base' via vmcoreinfo, we can > > easily use the same > > for live-debugging a running kernel via user-space utilities, which > > can benefit by reading this value > > from the vmcoreinfo note inside kcore directly without relying on other methods. > > We have got a method, what's wrong with that? Only KCORE_REMAP adding, > again? if fix, what is the defect? Where's patch, analysis, only one > sentence to say KCORE_REMAP caused that? Please see above, this is not limited to one use-case of makedumpfile tool. Also, it extends to other user-space utilities (like crash) as well. See, why would we want to have every user-space utility implement a different mechanism for solving the same problem, right? If a standard interface is available from the kernel side we better use the same across all user-land. Plus, if this interface (vmcoreinfo in kcore) is available from the kernel side for all archs (like it is currently), it solves another set of problem in the user-space tools, i.e. we don't need to keep different arch specific implementations across different user-space tool (see [1] above for e.g., both x86_64 and arm64 makedumpfile code bases use an almost similar approach now). I understand that such changes are always obstructive, but hopefully it saves us the effort to manage differences across user-land in the longer run. So, via this patch I propose that the kernel export 'page_offset_base' variable via vmcoreinfo and user-space uses its uniformly to determine the start of direct mapping of all physical memory. Hope this clarifies the background. Regads, Bhupesh > > > > The x86_64 kernel code ('arch/x86/kernel/head64.c'), already sets the same as: > > unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4; > > > > and also uses the same to indicate the base of KASLR regions on x86_64: > > static __initdata struct kaslr_memory_region { > > unsigned long *base; > > unsigned long size_tb; > > } kaslr_regions[] = { > > { &page_offset_base, 0 }, > > > > so it can be used for both the above purposes across user-space utilities. > > > > Hope this explains the intention behind this patch. > > > > Thanks, > > Bhupesh > > > > > About this, I have replied to you in > > > lkml.kernel.org/r/20181025063446.GD2120@MiWiFi-R3L-srv > > > You might miss it. > > > > > > About this exporting, I ever posted patch to upstream and we have had > > > discussion, please check > > > https://lore.kernel.org/patchwork/patch/723472/ > > > > > > In makedumpfile and crash, we have had a clear method to analyze and > > > deduce it from kcore or vmcore. > > > > > > Thanks > > > Baoquan > > > > > > On 10/27/18 at 04:13am, Bhupesh Sharma wrote: > > > > Since commit 23c85094fe1895caefdd > > > > ["proc/kcore: add vmcoreinfo note to /proc/kcore"]), '/proc/kcore' > > > > contains a new PT_NOTE which carries the VMCOREINFO information. > > > > > > > > If the same is available, one can use it in user-land to > > > > retrieve machine specific symbols or strings being appended to the > > > > vmcoreinfo even for live-debugging of the primary kernel as a > > > > standard interface exposed by kernel for sharing machine specific > > > > details with the user-land. > > > > > > > > In the past I had a discussion with James, where he suggested this > > > > approach (please see [0]) and I really liked the idea. Since then I > > > > have been working on unifying the implementations of > > > > (atleast) the commonly used user-space utilities that provide > > > > live-debugging capabilities (tools like 'makedumpfile' and > > > > 'crash-utility', see [1] for details of these tools). > > > > > > > > For the same, when live debugging on x86_64 machines, user-space > > > > tools currently rely on different mechanisms to determine > > > > the 'page_offset_base' value (i.e. start of direct mapping of all > > > > physical memory). One of the approach used by 'makedumpfile' > > > > user-space tool for e.g. is to calculate the same from the last > > > > PT_LOAD available in '/proc/kcore', which can be flaky as and when > > > > new sections (for e.g. KCORE_REMAP which was added > > > > to recent kernels) are added to kcore. > > > > > > > > For other architectures like arm64, I have already proposed using > > > > the vmcoreinfo note (in '/proc/kcore') in the user-space utilities to > > > > determine machine specific details like VA_BITS, PAGE_OFFSET, > > > > kasrl_offset() (see [2] for details), for which different user-space > > > > tools earlier used different (and at times flaky) approaches like: > > > > > > > > - Reading kernel CONFIGs from user-space and determining CONFIG values > > > > like VA_BITS from there. > > > > - Reading symbols from '/proc/kallsyms' and determining their values > > > > via '/dev/mem' interface. > > > > - Reading symbols from 'vmlinux' and determing their values from > > > > reading memory. > > > > > > > > This patch allows appending 'page_offset_base' for x86_64 platforms > > > > to vmcoreinfo, so that user-space tools can use the same as a standard > > > > interface to determine the start of direct mapping of all physical > > > > memory. > > > > > > > > Testing: > > > > ------- > > > > - I tested this patch (rebased on 'linux-next') on a x86_64 machine > > > > using the modified 'makedumpfile' user-space code (see [3] for my > > > > github tree which contains the same) for determining how many pages > > > > are dumpable when different dump_level is specified (which is > > > > one use-case of live-debugging via 'makedumpfile'). > > > > - I tested both the KASLR and non-KASLR boot cases with this patch. > > > > - Here is one sample log (for KASLR boot case) on my x86_64 machine: > > > > > > > > < snip..> > > > > The kernel doesn't support mmap(),read() will be used instead. > > > > > > > > TYPE PAGES EXCLUDABLE DESCRIPTION > > > > ---------------------------------------------------------------------- > > > > ZERO 21299 yes Pages filled > > > > with zero > > > > NON_PRI_CACHE 91785 yes Cache > > > > pages without private flag > > > > PRI_CACHE 1 yes Cache pages with > > > > private flag > > > > USER 14057 yes User process > > > > pages > > > > FREE 740346 yes Free pages > > > > KERN_DATA 58152 no Dumpable kernel > > > > data > > > > > > > > page size: 4096 > > > > Total pages on system: 925640 > > > > Total size on system: 3791421440 Byte > > > > > > > > I understand that there might be some reservations about exporting > > > > such machine-specific details in the vmcoreinfo, but to unify > > > > the implementations across user-land and archs, perhaps this would be > > > > good starting point to start a discussion. > > > > > > > > [0]. https://www.mail-archive.com/kexec@xxxxxxxxxxxxxxxxxxx/msg20300.html > > > > [1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8) > > > > [2]. https://www.spinics.net/lists/kexec/msg21608.html > > > > http://lists.infradead.org/pipermail/kexec/2018-October/021725.html > > > > [3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1 > > > > > > > > Cc: Boris Petkov <bp@xxxxxxxxx> > > > > Cc: Baoquan He <bhe@xxxxxxxxxx> > > > > Cc: Ingo Molnar <mingo@xxxxxxxxxx> > > > > Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx> > > > > Cc: Kazuhito Hagio <k-hagio@xxxxxxxxxxxxx> > > > > Cc: Dave Anderson <anderson@xxxxxxxxxx> > > > > Cc: James Morse <james.morse@xxxxxxx> > > > > Cc: Omar Sandoval <osandov@xxxxxx> > > > > Cc: x86@xxxxxxxxxx > > > > Cc: kexec@xxxxxxxxxxxxxxxxxxx > > > > Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx > > > > Signed-off-by: Bhupesh Sharma <bhsharma@xxxxxxxxxx> > > > > --- > > > > arch/x86/kernel/machine_kexec_64.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c > > > > index 4c8acdfdc5a7..834ccefef867 100644 > > > > --- a/arch/x86/kernel/machine_kexec_64.c > > > > +++ b/arch/x86/kernel/machine_kexec_64.c > > > > @@ -356,6 +356,7 @@ void arch_crash_save_vmcoreinfo(void) > > > > VMCOREINFO_SYMBOL(init_top_pgt); > > > > vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n", > > > > pgtable_l5_enabled()); > > > > + VMCOREINFO_NUMBER(page_offset_base); > > > > > > > > #ifdef CONFIG_NUMA > > > > VMCOREINFO_SYMBOL(node_data); > > > > -- > > > > 2.7.4 > > > > _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec