On Mon, Oct 07, 2019 at 08:13:07PM +0000, Kazuhito Hagio wrote: > > [ 518.819690] Original pages : 0x0000000000000000 > > [ 518.828894] Excluded pages : 0x0000000003decd15 > > [ 518.838635] Pages filled with zero : 0x00000000000210ee > > [ 518.849920] Non-private cache pages : 0x000000000000271a > > [ 518.861218] Private cache pages : 0x000000000000da47 > > [ 518.872502] User process data pages : 0x0000000003d6bdc8 > > [ 518.883786] Free pages : 0x000000000004fcfe > > [ 518.895070] Hwpoison pages : 0x0000000000000000 > > [ 518.906356] Offline pages : 0x0000000000000000 > > [ 518.917659] Remaining pages : 0xfffffffffc2132eb > > [ 518.927398] Memory Hole : 0x0000000004080000 > > This is the known issue that I wrote above and am looking for a safe fix. > How does this patch work? I'll give this a try, and see how it goes for a few days. > If it looks good, I'll look into its side effects further, > but might take some time.. > > And the crashdump seems corrupt: > > > Could you show me the output of "readelf -a vmcore"? See below. > Does this issue always reproduce? Not 100% the time. Sometimes we do get valid dumps from these hosts. My guess so far is that it has something to do with how much of memory makedumpfile was able to discard with -d31 Common case seems to be: <F28>ELF Header: Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 Class: ELF64 Data: 2's complement, little endian Version: 1 (current) OS/ABI: UNIX - System V ABI Version: 0 Type: CORE (Core file) Machine: Advanced Micro Devices X86-64 Version: 0x1 Entry point address: 0x0 Start of program headers: 64 (bytes into file) Start of section headers: 0 (bytes into file) Flags: 0x0 Size of this header: 64 (bytes) Size of program headers: 56 (bytes) Number of program headers: 23881 Size of section headers: 0 (bytes) Number of section headers: 0 Section header string table index: 0 There are no sections in this file. There are no sections to group in this file. Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 ... <repeats for thousands of lines> NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0 There is no dynamic section in this file. There are no relocations in this file. The decoding of unwind sections for machine type Advanced Micro Devices X86-64 is not currently supported. Dynamic symbol information is not available for displaying symbols. No version information found in this file. There are some other failure cases with non-null data, so maybe there's >1 bug here. I've not seen an obvious pattern to this. eg... https://pastebin.com/2uM4sBCF I'll put your patch on some of the affected hosts and see if this changes behaviour in any way. thanks, Dave _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec