Well, it seems we have some good results with this patch [0] - the idea behind the issue is that ena network driver has no PCI shutdown() handler, which would be called to gently quiesce the device before the kexec. The PCI stack in this case clears the master bit of the device configuration space, effectively stopping all the DMA transactions. But then, when the system boots the kexec'ed kernel, the network device firmware may send a memory write regarding that stopped DMA transaction (that is now invalid), corrupting some random kernel memory area. I've ran 1000 kexecs tests with mainline (5.6-rc5) + this patch and no failures were observed. Also, I'm running a test with Ubuntu 5.3 kernel + this patch and achieved > 450 runs now, with no failures (test is ongoing). I've tried to dump the initrd content (could be useful now to identify the corruption signature, maybe matching some ena admin queue periodic task) but I had trouble collecting the dmesg in case of failure. It gets huge and requires a big ramoops allocation, which unfortunately prevents the issue from happening (I guess the corruption ends-up happening in the ramoops reserved area, not initrd area anymore). Bhupesh, I've noticed that suddenly the Red Hat bugzilla got private - is it okay to add me in CC list so I can see it? Thanks for all the collaboration, I hope the issue was figured and solved! Cheers, Guilherme [0] lore.kernel.org/netdev/20200320125534.28966-1-gpiccoli@xxxxxxxxxxxxx _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec