Hi, On Mon, Mar 2, 2020 at 1:39 PM Dave Young <dyoung@xxxxxxxxxx> wrote: > > On 03/02/20 at 12:20am, Bhupesh Sharma wrote: > > Hi Guilherme, > > > > On Sat, Feb 29, 2020 at 10:37 PM Guilherme G. Piccoli > > <gpiccoli@xxxxxxxxxxxxx> wrote: > > > > > > Hi Bhupesh and Dave (and everybody CC'ed here), I'm Guilherme Piccoli > > > and I'm working in the same issue observed in RH bugzilla 1758323 [0] - > > > or at least, it seems to be the the same heh > > > > Ok. > > > > > The reported issue in my case was that the 2nd kexec fails on Nitro > > > instanced, and indeed it's reproducible. More than this, it shows as an > > > initrd corruption. I've found 2 workarounds, using the "new" kexec > > > syscall (by doing kexec -s -l) and keep the initrd memory "un-freed", > > > using the kernel parameter "retain_initrd". > > > > I have a couple of questions: > > - How do you conclude that you see an initrd corruption across kexec? > > Do you print the initial hex contents of initrd across kexec? > > I'm also interested if any of you can dump the initrd memory in kernel > printk log, and then save to somewhere to compare with the original > initrd content. I did several overnight tests on the aws machine and can confirm kexec reboot failure issue (multiple tries) can be seen even with 'retain_initrd' in the kernel bootargs or by using kexec_file_load ('kexec -s -l') instead of plain kexec_load ('kexec -l'). - Here are my observations: 1. Adding 'retain_initrd' to the bootargs, helps delay the kexec reboot failure (when successive kexec reboots are executed), but the (possible ?) initrd corruption is still seen (as per the panic logs from the kexec kernel). 2. I printed the first 4M of initrd file via kernel code (both in the primary and kexec kernel, see <https://bugzilla.redhat.com/attachment.cgi?id=1667523> and <https://bugzilla.redhat.com/attachment.cgi?id=1667521>) and interestingly the first 4M contents are _exactly_ similar for primary and kexec kernel, even though we see a (possible ?) initrd corruption. See logs below from kexec kernel in case of panic: [ 4.229170] Call Trace: [ 4.234379] dump_stack+0x5c/0x80 [ 4.239840] panic+0xe7/0x2a9 [ 4.245291] do_exit.cold.22+0x59/0x81 [ 4.251025] do_group_exit+0x3a/0xa0 [ 4.256784] __x64_sys_exit_group+0x14/0x20 [ 4.262905] do_syscall_64+0x5b/0x1a0 [ 4.268537] entry_SYSCALL_64_after_hwframe+0x65/0xca [ 4.275784] RIP: 0033:0x7ff749106e2e [ 4.281469] Code: Bad RIP value. [ 4.286981] RSP: 002b:00007fffb6d707f8 EFLAGS: 00000206 ORIG_RAX: 00000000000000e7 [ 4.298381] RAX: ffffffffffffffda RBX: 00007ff74910f528 RCX: 00007ff749106e2e [ 4.305616] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f [ 4.313064] RBP: 00007ff749306000 R08: 00000000000000e7 R09: 00007fffb6d70708 [ 4.320369] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000 [ 4.327671] R13: 0000000000000022 R14: 00007ff749306148 R15: 00007ff749306030 [ 4.335396] Kernel Offset: 0x2a400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) [ 4.348002] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 [ 4.348002] ]--- 2020-03-03T09:01:27+00:00 3. So the root-cause seems to be something else. I will do some more debugging to evaluate the same. 4. I added two scripts (via <https://bugzilla.redhat.com/attachment.cgi?id=1667561> and <https://bugzilla.redhat.com/attachment.cgi?id=1667560>) which provide an automated reproducer. This reproducer can be run on the Host machine and launches repeated kexec reboots on the aws machine. Normally approx. 5-12 runs of the master script (i.e. kexec reboots) can lead to a panic in the kexec kernel which indicates a (possible ?) initrd corruption. @Guilherme: Can you please help verify the observations on your setup (both amazon and upstream kernel) using the automated test script? Thanks. Regards, Bhupesh _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec