Borislav Petkov <bp@xxxxxxxxx> writes: > On Mon, Dec 30, 2024 at 01:54:36PM +0800, Huang, Ying wrote: >> For example, it may be OK to wait forever for a software error, but it may >> be better to reboot the system to contain the influence of the hardware >> error for some hardware errors. > > A default panic timeout of 30 seconds for hw errors?! You do realize that 30 > seconds for a machine is an eternity and by that time your hardware error has > long propagated and corrupted results, right? > > So your timeout is not even trying to do what you want. > > So unless I'm missing something, this ghes timeout needs to go - if you want > to "contain the influence" you need to panic *immediately*! And not even that > would help in some cases - hw has its own protections there so the OS > panicking is meh. At least on x86, that is. OK. 30 seconds isn't good enough for hw errors. Another possible benefit of ghes_panic_timeout is, rebooting instead of waiting forever can help us to log/report the hardware errors earlier. For example, the hardware errors could be logged in some simple non-volatile storage (such as EFI variables) during panic or kdump, etc. Then, after reboot, the new kernel could report the hardware errors in some way. >> So, we introduced another knob for that. > > No, that another knob is piling more of the silly ontop. --- Best Regards, Huang, Ying