On Wed, Oct 09, 2019 at 08:03:51PM +0000, Kazuhito Hagio wrote: > > 0x0000000000000000 0x0000000000000000 0 > > NULL 0x0000000000000000 0x0000000000000000 0x0000000000000000 > > 0x0000000000000000 0x0000000000000000 0 > > > > In this case, was the "makedumpfile Completed." message emitted? > It looks like the buffer of program headers was not written to the file.. Our logging infra didn't capture the makedumpfile output. I've fixed that up, so hopefully next time.. > Anyway, a debugging patch attached below. > > > There are some other failure cases with non-null data, so maybe there's >1 bug here. > > I've not seen an obvious pattern to this. eg... > > > > https://pastebin.com/2uM4sBCF > > > > As for this case, I suspect that Elf64_Ehdr.e_phnum overflows > (i.e. num_loads_dumpfile > 65535): Oh, good catch. These are 256GB machines, so after discarding everything, that explains why we end up with so many sections. This also explains why it sometimes works I think, when the discarding manages to get the total nr headers <64k. > > I'll put your patch on some of the affected hosts and see if this > > changes behaviour in any way. > > If you can try the patch below, which includes the previous patch, > please show me: > - the debugging output of makedumpfile > - readelf -a vmcore > - ls -ls vmcore Will take me a few days (travelling right now), but when hopefully by the time I get back we'll have some data. thanks for looking into this. Dave _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec