Re: DL380 instability with hpwdt

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 09, 2024 at 09:00:00PM +0200, Ben Hutchings wrote:
> Hi Jerry,
> 
> The Debian kernel team received a number of reports over the past few
> years of instability of the Proliant DL380 G7 and DL380p G8, seemingly
> related to the hpwdt driver (in that this goes away if it is not
> loaded).  These reports can be seen at
> <https://bugs.debian.org/898336>.
> 
> The instability has been seen with kernel versions ranging from 4.16 to
> 6.1.y, including after the backport of commit dced0b3e51dd
> "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO").
> 
> I can see that hpwdt seems to be used for error reporting so it's not
> clear to me whether these are problems caused by the driver, or the
> driver is only reporting that something bad happened.
> 
> Do you have any ideas about what's going wrong here?  Is there
> something odd about these models that needs to be handled in hpwdt, or
> are they just popular models?

Hi Ben,

There are a couple things that come to mind.

As you mentioned,  hpwdt is used for error containment on ProLiants.
(Especially on the older generations) Errors would be raised as
NMI and the expectation was that hpwdt would handle the NMI and
initiate a kdump.  I have seen cases where shutting down file
systems can raise PCIe errors which would be transmitted to the
SUT as NMI and handled by hpwdt.

The second issue is that systemd enables WDT (not just hpwdt) during 
shutdown.  This is to handle the case where shutdown hangs.  The WDT
is supposed to break the system out of such situations.  The default
timeout is 10 minutes:
	/etc/systemd/system.conf:
	#RebootWatchdogSec=10min
(note, I'm not a Debian user, but i believe the systemd behavior is the
same on Debian as it is on rhel/sles.)

While a ten minute delay to shutdown would be fairly obvious if you're
doing interactive testing, it might not be noticed if the testing is
automated.

To determine if either of the above is happening, you can:

o) do the testing interactively and time the test.  Does the NMI come in
roughly 10 minutes after the shutdown?

o) Check the IEL and IML on the iLO web interface.  Do you see any
errors reported during the shutdown?


Questions:
1) The Debian bug above mentions only Gen 7 and 8 systems.
   Are you seeing this issue on other ProLiant systems?

2) You mentioned back-porting commit dced0b3e51dd.  Does your
   drivers/watchdog/hpwdt.c source match upstream Linux? Or
   do you cherry pick patches?  (sorry, not knowing Debian,
   I don't know how find/navigate your kernel source.)

Please let me know what you find.


Jerry


-- 

-----------------------------------------------------------------------------
Jerry Hoemann                  Software Engineer   Hewlett Packard Enterprise
-----------------------------------------------------------------------------




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux