Hi Guilherme, On Sat, Feb 29, 2020 at 10:37 PM Guilherme G. Piccoli <gpiccoli@xxxxxxxxxxxxx> wrote: > > Hi Bhupesh and Dave (and everybody CC'ed here), I'm Guilherme Piccoli > and I'm working in the same issue observed in RH bugzilla 1758323 [0] - > or at least, it seems to be the the same heh Ok. > The reported issue in my case was that the 2nd kexec fails on Nitro > instanced, and indeed it's reproducible. More than this, it shows as an > initrd corruption. I've found 2 workarounds, using the "new" kexec > syscall (by doing kexec -s -l) and keep the initrd memory "un-freed", > using the kernel parameter "retain_initrd". I have a couple of questions: - How do you conclude that you see an initrd corruption across kexec? Do you print the initial hex contents of initrd across kexec? - Also do you try repeated/nested kexec and see initrd corruption after several kexec reboot attempts? I have the following observations on my Nitro instance: - With upstream kernel (5.6.0-rc3), I am seeing that the repeated kexec attempts even with 'kexec -s -l' and using 'retain_initrd' in the kernel bootargs, can I lead to kexec reboot failures. Although the frequency of the failure goes down drastically with these, as compared to vanilla 'kexec -s' invocation. Here are the aws console logs on the nitro console with kernel 5.6.0-rc3+ on an x86_64 instance when the 'kexec -s -l' or 'kexec -l' with 'retain_initrd' fails: login: [ 80.077578] Unregister pv shared memory for cpu 1 [ 80.081755] Unregister pv shared memory for cpu 0 [ 80.209953] kexec_core: Starting new kernel 2020-02-29T19:20:16+00:00 <.. no console logs after this (even after adding earlycon) ..> - Note that there are no updated console log from the kexec kernel in the failure case, so I am not sure if this was caused by some other issue or the initrd corruption only. - With the above, one needs to execute kexec reboot repeatedly and normally in the ~ 11-15 kexec reboot run, you can see a kexec reboot failure. > I've noticed that your interesting investigation in the BZ led to > SWIOTLB as a potential culprit, but trying with "swiotlb=noforce" or > even "iommu=off" didn't help me. > Also, worth notice a weird behavior: seems Amazon Linux 2 (based on > kernel 4.14) sometimes works, or better saying, in some instances it > works. I have 2x t3.large instances, in one of them I can make the > Amazon Linux works (and to isolate potential out-of-tree patches, I've > used Amazon Linux 2 config file and built a mainline 4.14, which also > works in that particular instance). That's good news, I am not sure about Amazon Linux (I am not sure if the source for the same is available without buying a license). I can share that "swiotlb=noforce" worked for me on one instance, but the same was not reproducible on other nitro instances, so I think the background issue is initrd corruption, but not able to pin-point at the root-cause of the corruption yet. BTW, have you been able to try the following kexec-tools fix as well (see [1]) and see if this fixes the initrd corruption with 'kexec -s -l' and 'kexec -l' (i.e. without using 'retain_initrd' bootargs) [1]. http://lists.infradead.org/pipermail/kexec/2020-February/024531.html > The reason for this email is to ask if you managed to figure the issue > root-cause, or have some leads. I continue the debug here, but it's a > bit difficult without access to AWS hypervisor (and it seems like a > hypervisor issue for me). The fact that preserving the initrd memory > prevents the problem seems to indicate that after freeing such > high-address memory, the hypervisor somewhat manages to use that > regardless if some other code is using that...ending up corrupting the > initrd. > > I've also looped the kexec list in order to grow the audience, maybe > somebody already faced that kind of issues and have some ideas. > A collaboration in this debug would be greatly appreciate by me, it's a > quite interesting issue and I'm looking forward to understand what's > going on. > > Thanks in advance, Thanks a lot for your email. Let's continue discussing and hopefully we will have a fix for the issue soon. Regards, Bhupesh _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec