Hello Guilherme, On Fri, Mar 20, 2020 at 9:10 PM Guilherme G. Piccoli <gpiccoli@xxxxxxxxxxxxx> wrote: Thanks for writing again. I was caught up in trying several other suggestions/code-snippets to further debug this. I tried several combinations - turning iommu off, turning off swiotlb in the kexec kernel and testing various combinations with retain_initrd added to the kexec kernel's bootargs. But nothing seems to fix the nested repetitive kexec reboot attempts on the aws t3 machines I have. It just becomes better on few instances (i.e. the kexec reboots would survive around 10 nested repetitive attempts), while on the other(s) the failure can be seen quite frequently (approx ~3 kexec reboot attempts). > Well, it seems we have some good results with this patch [0] - the idea > behind the issue is that ena network driver has no PCI shutdown() > handler, which would be called to gently quiesce the device before the > kexec. The PCI stack in this case clears the master bit of the device > configuration space, effectively stopping all the DMA transactions. But > then, when the system boots the kexec'ed kernel, the network device > firmware may send a memory write regarding that stopped DMA transaction > (that is now invalid), corrupting some random kernel memory area. > > I've ran 1000 kexecs tests with mainline (5.6-rc5) + this patch and no > failures were observed. Also, I'm running a test with Ubuntu 5.3 kernel > + this patch and achieved > 450 runs now, with no failures (test is > ongoing). > > I've tried to dump the initrd content (could be useful now to identify > the corruption signature, maybe matching some ena admin queue periodic > task) but I had trouble collecting the dmesg in case of failure. It gets > huge and requires a big ramoops allocation, which unfortunately prevents > the issue from happening (I guess the corruption ends-up happening in > the ramoops reserved area, not initrd area anymore). This is a really good debug and resulting patch. I ran almost ~60 kexec repetitive attempts last night and also repeated the same today morning and the issue seems to get fixed for me with upstream kernel 5.6.0-rc6+ with this patch. I am leaving a test running with RHEL kernel + this patch overnight and will have more updates to share by tomorrow morning. > Bhupesh, I've noticed that suddenly the Red Hat bugzilla got private - Oops. I will check. > is it okay to add me in CC list so I can see it? Sure. I tried doing it, but seems Bugzilla is not happy as it keeps complaining that you are not registered on BZ, I will try to find out internally how to get around the issue. > Thanks for all the collaboration, I hope the issue was figured and solved! Sure. Thanks a lot for your inputs and trying the suggestions I posted on the Bugzilla ticket. I will soon share an update with RHEL/Fedora kernel kexec tests with this patch applied and also reply with a Tested-by for the upstream patch in the relevant thread. Thanks, Bhupesh _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec