On Wed, Oct 09, 2024 at 09:00:00PM +0200, Ben Hutchings wrote: > Hi Jerry, > > The Debian kernel team received a number of reports over the past few > years of instability of the Proliant DL380 G7 and DL380p G8, seemingly > related to the hpwdt driver (in that this goes away if it is not > loaded). These reports can be seen at > <https://bugs.debian.org/898336>. > > The instability has been seen with kernel versions ranging from 4.16 to > 6.1.y, including after the backport of commit dced0b3e51dd > "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO"). > > I can see that hpwdt seems to be used for error reporting so it's not > clear to me whether these are problems caused by the driver, or the > driver is only reporting that something bad happened. > > Do you have any ideas about what's going wrong here? Is there > something odd about these models that needs to be handled in hpwdt, or > are they just popular models? Hi Ben, There are a couple things that come to mind. As you mentioned, hpwdt is used for error containment on ProLiants. (Especially on the older generations) Errors would be raised as NMI and the expectation was that hpwdt would handle the NMI and initiate a kdump. I have seen cases where shutting down file systems can raise PCIe errors which would be transmitted to the SUT as NMI and handled by hpwdt. The second issue is that systemd enables WDT (not just hpwdt) during shutdown. This is to handle the case where shutdown hangs. The WDT is supposed to break the system out of such situations. The default timeout is 10 minutes: /etc/systemd/system.conf: #RebootWatchdogSec=10min (note, I'm not a Debian user, but i believe the systemd behavior is the same on Debian as it is on rhel/sles.) While a ten minute delay to shutdown would be fairly obvious if you're doing interactive testing, it might not be noticed if the testing is automated. To determine if either of the above is happening, you can: o) do the testing interactively and time the test. Does the NMI come in roughly 10 minutes after the shutdown? o) Check the IEL and IML on the iLO web interface. Do you see any errors reported during the shutdown? Questions: 1) The Debian bug above mentions only Gen 7 and 8 systems. Are you seeing this issue on other ProLiant systems? 2) You mentioned back-porting commit dced0b3e51dd. Does your drivers/watchdog/hpwdt.c source match upstream Linux? Or do you cherry pick patches? (sorry, not knowing Debian, I don't know how find/navigate your kernel source.) Please let me know what you find. Jerry -- ----------------------------------------------------------------------------- Jerry Hoemann Software Engineer Hewlett Packard Enterprise -----------------------------------------------------------------------------