Re: [non-pretimeout,4/7] Watchdog: introduce ARM SBSA watchdog driver

Fu Wei <fu.wei@xxxxxxxxxx> · Wed, 24 Jun 2015 01:01:04 +0800

Hi Guenter

On 24 June 2015 at 00:43, Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> On Wed, Jun 24, 2015 at 12:17:19AM +0800, Fu Wei wrote:
>> Hi Guenter,
>>
>> you always can provide help very quickly, thank you very much :-)
>>
>> On 23 June 2015 at 23:21, Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
>> > On Tue, Jun 23, 2015 at 09:26:35PM +0800, Fu Wei wrote:
>> >> Hi Guenter,
>> > [ ...]
>> >
>> >> >
>> >> >> + *       When the first timeout occurs, WS0(SPI or LPI) is triggered,
>> >> >> + *       the second timeout period(as long as the first timeout period) starts.
>> >> >
>> >> > no longer accurate if WOR is used for the second period.
>> >> >
>> >> >> + *       In WS0 interrupt routine, panic() will be called for collecting
>> >> >> + *       crashdown info.
>> >> >> + *       If system can not recover from WS0 interrupt routine, then second
>> >> >> + *       timeout occurs, WS1(reset or higher level interrupt) is triggered.
>> >> >> + *       The two timeout period can be set by WOR(32bit).
>> >> >
>> >> > The second timeout period is determined by ...
>> >> >
>> >> >> + *       WOR gives a maximum watch period of around 10s at the maximum
>> >> >> + *       system counter frequency.
>> >> >> + *       The System Counter shall run at maximum of 400MHz.
>> >> >
>> >> > "... at the maximum system counter frequency of 400 MHz.", and drop the
>> >> > last sentence.
>> >>
>> >> For the second timeout period,  I have discussed with a kdump developers,
>> >> (1)10s maybe not good enough for all the case of panic + kdump, so
>> >> maybe we still need to use WCV in the second timeout period
>> >> (2)in the second timeout period, maybe we need to programme WCV for
>> >> two reason: a, trigger WS1 to reboot system ASAP; b, feed the watchdog
>> >> without cleanning WS0 flag.
>> >>
>> >> WHY we want to feed the watchdog (keepalive) without cleanning WS0 flag??
>> >> REASON:
>> >> (1)if the system context is large, we may need to feed the dog until
>> >> we get all the things backed up.
>> >> (2)if system goes wrong,  WS0 triggered, then panic--> kdump. if we
>> >> feed the dog by WRR or programming WOR, WS0 flag will be cleaned. Once
>> >> system goes wrong again, then panic again.....
>> >> So this system will be in a panic--kdump--panic--kdump loop, have not
>> >> chance to reset.
>> >>
>> >> So if we are in the second timeout period, we may need to always programme WCV.
>> >>
>> > The crashdump kernel is supposed to reload the watchdog driver, which will ping
>> > the watchdog. If it isn't able to do that in 10 seconds, something is wrong.
>>
>> yes, 10s maybe not enough for all case.
>> When I tested kdump on arm64, sometimes , it took 20s. So I am
>> thinking : can we make the max value of pretimeout > 10s in this
>> driver.
>>
> It takes more than 10 seconds to load the crashdump kernel,
> or it takes more than 10 seconds to complete the dump ?

It takes more than 10 seconds to boot into kernel(from panic to finish
devices init in crashdump kernel).
I thinks that maybe depend on hardware or soc.
As I said, 10 seconds maybe not enough for all cases.

For completing the dump, 10 seconds maybe not enough for some case(big
RAM, dump to network and so on),
that is why I added "ping without cleaning WS0" support in the second stage.

>
>>
>> >
>> >> >> +
>> >> >> +     status = readl_relaxed(gwdt->control_base + SBSA_GWDT_WCS);
>> >> >> +     if (status & SBSA_GWDT_WCS_WS1) {
>> >> >> +             dev_warn(dev, "System reset by WDT(WCV: %llx)\n",
>> >> >> +                      sbsa_gwdt_get_wcv(wdd));
>> >> >
>> >> > WCV here only tells us how many clock cycles were executed since the
>> >> > system started (or something like that). So I still don't understand
>> >> > why it is valuable to print that number.
>> >>
>> >> this number provides the time of system reset, I thinks that may help
>> >> admin to analyse the system failure.
>> >>
>> > It doesn't mean anything to anyone but you since it is not in a well defined
>> > time scale.
>>
>> maybe I should convert it to second?
>> I think the original value is better?
>>
>
> I think you should drop it.

OK, will do in my next patchset.

But my option is if hardware provide this info, and it can let admin
know the crash time.  maybe it can help to debug.
 :-)

>
> Guenter

-- 
Best regards,

Fu Wei
Software Engineer
Red Hat Software (Beijing) Co.,Ltd.Shanghai Branch
Ph: +86 21 61221326(direct)
Ph: +86 186 2020 4684 (mobile)
Room 1512, Regus One Corporate Avenue,Level 15,
One Corporate Avenue,222 Hubin Road,Huangpu District,
Shanghai,China 200021
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html