Re: About kexec issues in AWS nitro instances (RH bz 1758323)

"Guilherme G. Piccoli" <gpiccoli@xxxxxxxxxxxxx> · Thu, 5 Mar 2020 19:14:07 -0300

[I'm responding on top of my last message, fully quoting it below
because it was moderated and didn't get published in the mailing-list,
for some reason. If any moderator can make it public, I appreciate!]

Hi Bhupesh, I re-tested again using 5.6-rc4 with "retain_initrd" and
"swiotlb=noforce" and got a quite interesting discrepancy. First run got
me 99 kexecs with no issue (the public IP of my AWS instance was
3.215.x.y). After this, I powered the instance off and some minutes
later, restarted it (and the new IP was 34.239.x.y) - guess what? It
failed after 6th kexec iteration with an oops, which I was able to
collect [0] using pstore.

So, I'm inclined to think when I restarted the instance (and it got a
different IP, on a different range), likely it got deployed in a
different host, which explain some differences we are observing across
tests. I collected DMI data on both but it didn't show me any difference
- it is though feasible to hide host details from guest (almost?)
completely, so this should be a question to AWS.

Finally, I forgot to mention you in the previous email: you asked me
about testing kexec-tools with commit [1], and I tried also, but it
didn't help, specially because it affects the "/proc/iomem" memory read
path, but kexec-tools uses get_memory_ranges_sysfs() by default, which
reads from firmware memmap.
In the past I tried to force kexec-tools to read from /proc/iomem, but
it didn't help the issue. Now I just tried again forcing the usage of
get_memory_ranges_proc_iomem() with patch merged [1], but same issue
reproduces (failure on 2nd kexec with initrd corruption).

Cheers,

Guilherme

[0] https://pastebin.ubuntu.com/p/fS6c3sPMgk/
[1] http://lists.infradead.org/pipermail/kexec/2020-February/024531.html

On 04/03/2020 16:22, Guilherme G. Piccoli wrote:
> On 04/03/2020 15:39, Bhupesh Sharma wrote:
>> Hi,
> 
> Hi Bhupesh, thanks for your prompt and thorough response!
> I manage to do some tests myself, based on your last email, and will
> share my result inline below:
> 
> 
>>
>> On Mon, Mar 2, 2020 at 1:39 PM Dave Young <dyoung@xxxxxxxxxx> wrote:
>>>>
>>>> I have a couple of questions:
>>>> - How do you conclude that you see an initrd corruption across kexec?
>>>> Do you print the initial hex contents of initrd across kexec?
>>>
>>> I'm also interested if any of you can dump the initrd memory in kernel
>>> printk log, and then save to somewhere to compare with the original
>>> initrd content.
> 
> I didn't print yet Dave, but seems Bhupesh did and the 1st 4M are the
> same right? The way the issue shows to me is an oops on the 2nd kexec
> (in other words, the 1st kexec from a kexec'ed kernel!), with the
> following message:
> 
> "Initramfs unpacking failed: junk in compressed archive"
> 
> Also, I've added debug code on kernel initramfs routines to trace-printk
> file-by-file as they got decompressed; then, by doing
> "ftrace_dump_on_oops" I could check the list of files and it's really
> partial (the biggest part of the files are not decompressed).
> It fails usually in this if, on flush_buffer() [init/initramfs.c]:
> 
> if (c == '0')
> [...]
> else if (c == 0)
> [...]
> else
> [junk]
> 
> A print of 'c' variable in this point shows its value as 6.
> I'm attaching here a dmesg (collected through pstore/ramoops) so you can
> take a look.
> 
> 
>>
>> I did several overnight tests on the aws machine and can confirm kexec
>> reboot failure issue (multiple tries) can be seen even with
>> 'retain_initrd' in the kernel bootargs or by using kexec_file_load
>> ('kexec -s -l') instead of plain kexec_load ('kexec -l').
>>
> 
> I managed to test multiple kexecs in an automated way (using a crontab
> plus a script with a counter in my AWS instance) and you are right,
> after some kexecs it fails. My test survived for 70 kexecs, and it
> failed in the end by not jumping into the new kernel / failing really
> early and getting stuck on "kexec_core: Starting new kernel", as you said.
> 
> This seems to be a different manifestation of the issue, we seem to
> prevent the usual effect of initrd "corruption" by using the
> "retain_initrd" parameter.
> 
> Also, when I added both "retain_initrd" and "swiotlb=noforce" to
> command-line, the test failed after 10 iterations in a different way -
> it crashed and rebooted to regular kernel (as I have "oops=panic" and
> "panic=1" in my cmdline), but pstore wasn't enabled in that test, so
> didn't collect that information (I plan to re-test that).
> 
> 
>> - Here are my observations:
>>
>> 1. Adding 'retain_initrd' to the bootargs, helps delay the kexec
>> reboot failure (when successive kexec reboots are executed), but the
>> (possible ?) initrd corruption is still seen (as per the panic logs
>> from the kexec kernel).
>>
>> 2. I printed the first 4M of initrd file via kernel code (both in the
>> primary and kexec kernel, see
>> <https://bugzilla.redhat.com/attachment.cgi?id=1667523> and
>> <https://bugzilla.redhat.com/attachment.cgi?id=1667521>) and
>> interestingly the first 4M contents are _exactly_ similar for primary
>> and kexec kernel, even though we see a (possible ?) initrd corruption.
>> See logs below from kexec kernel in case of panic:
>>
>> [    4.229170] Call Trace:
>> [    4.234379]  dump_stack+0x5c/0x80
>> [    4.239840]  panic+0xe7/0x2a9
>> [    4.245291]  do_exit.cold.22+0x59/0x81
>> [    4.251025]  do_group_exit+0x3a/0xa0
>> [    4.256784]  __x64_sys_exit_group+0x14/0x20
>> [    4.262905]  do_syscall_64+0x5b/0x1a0
>> [    4.268537]  entry_SYSCALL_64_after_hwframe+0x65/0xca
>> [    4.275784] RIP: 0033:0x7ff749106e2e
>> [    4.281469] Code: Bad RIP value.
>> [    4.286981] RSP: 002b:00007fffb6d707f8 EFLAGS: 00000206 ORIG_RAX:
>> 00000000000000e7
>> [    4.298381] RAX: ffffffffffffffda RBX: 00007ff74910f528 RCX: 00007ff749106e2e
>> [    4.305616] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f
>> [    4.313064] RBP: 00007ff749306000 R08: 00000000000000e7 R09: 00007fffb6d70708
>> [    4.320369] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000000
>> [    4.327671] R13: 0000000000000022 R14: 00007ff749306148 R15: 00007ff749306030
>> [    4.335396] Kernel Offset: 0x2a400000 from 0xffffffff81000000
>> (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> [    4.348002] ---[ end Kernel panic - not syncing: Attempted to kill
>> init! exitcode=0x00007f00
>> [    4.348002]  ]---
>>         2020-03-03T09:01:27+00:00
>>
> 
> This is really interesting! If you could share the code you used to dump
> the initrd, I can try in my mainline build with Ubuntu config and dump
> the whole initrd to check if it's the same on regular and kexec'ed
> kernels. I was planning to work on something like this after Dave's
> suggestion...
> 
> Also, my oops splat is different from yours (as you can check in the
> attached dmesg); it really seems the initrd "corruption" is just one
> potential side-effect of this issue, you're observing a different failure.
> 
> 
>> 3. So the root-cause seems to be something else. I will do some more
>> debugging to evaluate the same.
> 
> Agreed! I'll debug from here too. I'm considering an instrumentation on
> the shutdown path and add "retain_initrd" to see if I can reproduce that
> hang (on "Starting new kernel") and collect more information - the
> difficult part is that when that issue occur, I can't access console via
> AWS interface and pstore won't work in this shutdown hang, since it's
> not an oops event heheh
> 
> 
>>
>> 4. I added two scripts (via
>> <https://bugzilla.redhat.com/attachment.cgi?id=1667561> and
>> <https://bugzilla.redhat.com/attachment.cgi?id=1667560>) which provide
>> an automated reproducer.
>>
>> This reproducer can be run on the Host machine and launches repeated
>> kexec reboots on the aws machine.
>>
>> Normally approx. 5-12 runs of the master script (i.e. kexec reboots)
>> can lead to a panic in the kexec kernel which indicates a (possible ?)
>> initrd corruption.
>>
>> @Guilherme: Can you please help verify the observations on your setup
>> (both amazon and upstream kernel) using the automated test script?
> 
> Thanks for sharing the script! I guess my approach with croontab already
> allowed me to verify your observations, right?
> 
> Now, what about the "swiotlb=noforce", does it still work for you as a
> workaround for this issue? Do you mind in sharing your .config with me,
> so I can try with your exact config to see if instead initrd
> "corruption", I'm presented with the same exact signature you got?
> 
> Thanks again, I appreciate a lot your collaboration =)
> Cheers,
> 
> 
> Guilherme
> 
> 
> 
>> Thanks.
>>
>> Regards,
>> Bhupesh
>>

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec