Hi Dave, I think I've fixed the ELF issues which I could reproduce: - wrong statistics - e_phnum overflow If you still see any problems with the latest makedumpfile, please let me know. Thanks, Kazu > -----Original Message----- > > -----Original Message----- > > > > > There are some other failure cases with non-null data, so maybe there's >1 bug here. > > > > > I've not seen an obvious pattern to this. eg... > > > > > > > > > > https://pastebin.com/2uM4sBCF > > > > > > > > > > > > > As for this case, I suspect that Elf64_Ehdr.e_phnum overflows > > > > (i.e. num_loads_dumpfile > 65535): > > > > > > Oh, good catch. These are 256GB machines, so after discarding > > > everything, that explains why we end up with so many sections. > > > This also explains why it sometimes works I think, when the discarding > > > manages to get the total nr headers <64k. > > I also could reproduce this issue on a system with 192GB memory. > The note was actually overwritten by the following program headers. > ----- > num_loads_dumpfile=76318 # more than 64k > ehdr64.e_phnum=10783 # overflowed > note.p_offset=0x93708 .p_filesz=0x2958 # The note data is at 0x93708 > note cd_header->offset=0x40 > ... > head->off= 90040 load.p_addr= 44552e000 .p_off= ed270060 ... > ^^^^^ # these headers overwrote the note data. > head->off= a0040 load.p_addr= 445630000 .p_off= ed272060 ... > ... > The dumpfile is saved to dump.Ed25.devel. > > makedumpfile Completed. > > # readelf -a dump.Ed25.devel > ... > Number of program headers: 10783 > ... > Displaying notes found at file offset 0x00093708 with length 0x00002958: > Owner Data size Description > 0x00000007 Unknown note type: (0xdbce6060) > description data: 00 00 7a 39 fffffff2 ffffff8a ffffffff > # ../crash vmlinux dump.Ed25.devel > > WARNING: possibly corrupt Elf64_Nhdr: n_namesz: 4185522176 n_descsz: 3 n_type: f4000 > ... > WARNING: cannot read linux_banner string > crash: vmlinux and dump.Ed25.devel do not match! > ----- > > > I think this will be the one of the causes, and had a look at how > > we can fix it. If you get a vmcore where this pattern occurs, > > you can try this tree: > > https://github.com/k-hagio/makedumpfile/tree/support-extended-elf > > > > Then, the crash utility also needs a patch to support a dumpfile > > that has more than 64k program headers: > > https://github.com/k-hagio/crash/tree/support-extended-elf > > These trees look to work well, though need more tests and tweaks. > ----- > # readelf -a dump.Ed25.test > ... > Number of program headers: 65535 (76319) <<-- note + loads > ... > Displaying notes found at file offset 0x00413748 with length 0x00002958: > Owner Data size Description > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > CORE 0x00000150 NT_PRSTATUS (prstatus structure) > ... > # ../crash-test vmlinux dump.Ed25.test > > crash-test> help -D > vmcore_data: > flags: c0 (KDUMP_LOCAL|KDUMP_ELF64) > ndfd: 3 > ofp: 3141560 > header_size: 4284576 > num_pt_load_segments: 76318 <<-- loads > pt_load_segment[0]: > ----- > > It is possible that the issue occurs on general systems if they have > large memory, so I'm going to proceed with those patches. > > Thanks, > Kazu > _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec