Re: About kexec issues in AWS nitro instances (RH bz 1758323)

"Guilherme G. Piccoli" <gpiccoli@xxxxxxxxxxxxx> · Fri, 20 Mar 2020 12:40:13 -0300

Well, it seems we have some good results with this patch [0] - the idea
behind the issue is that ena network driver has no PCI shutdown()
handler, which would be called to gently quiesce the device before the
kexec. The PCI stack in this case clears the master bit of the device
configuration space, effectively stopping all the DMA transactions. But
then, when the system boots the kexec'ed kernel, the network device
firmware may send a memory write regarding that stopped DMA transaction
(that is now invalid), corrupting some random kernel memory area.

I've ran 1000 kexecs tests with mainline (5.6-rc5) + this patch and no
failures were observed. Also, I'm running a test with Ubuntu 5.3 kernel
+ this patch and achieved > 450 runs now, with no failures (test is
ongoing).

I've tried to dump the initrd content (could be useful now to identify
the corruption signature, maybe matching some ena admin queue periodic
task) but I had trouble collecting the dmesg in case of failure. It gets
huge and requires a big ramoops allocation, which unfortunately prevents
the issue from happening (I guess the corruption ends-up happening in
the ramoops reserved area, not initrd area anymore).

Bhupesh, I've noticed that suddenly the Red Hat bugzilla got private -
is it okay to add me in CC list so I can see it?
Thanks for all the collaboration, I hope the issue was figured and solved!
Cheers,

Guilherme

[0] lore.kernel.org/netdev/20200320125534.28966-1-gpiccoli@xxxxxxxxxxxxx

_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec