Joe MacDonald ha scritto:
Resending to the whole list since the last time it looks like gmail
had decided I wanted to send HTML.
2008/5/27 Mike Frysinger <vapier.adi@xxxxxxxxx>:
On Tue, May 27, 2008 at 6:31 PM, T Ziomek wrote:
On Tue, May 27, 2008 at 06:27:58PM -0400, Mike Frysinger wrote:
On Tue, May 27, 2008 at 5:57 PM, David VomLehn wrote:
Continuous Logging for Watchdog Timer Expiration
------------------------------------------------
We run with a watchdog timer that can reboot the system. When we reboot, we
lose all of our status, making it very difficult to determine what went
wrong. Fortunately, there is only one major cause of not refreshing the
watchdog--a driver disabled interrupts for so long that the timer function
that resets the watchdog timer never had a chance to run. So, a way to log
what functions were enable and disabling interrupts on a continuous basis,
along with a memory section that wouldn't be overwritten on reboot, would
allow capturing the cause for these otherwise "silent" reboots.
how do you propose addressing that ? hardware watchdogs reset the
hardware, so there is nothing software can do to recover information
like register state. as soon as the watchdog timer expires, the state
is gone forever.
-mike
If I understand correctly David is talking about logging some trace-like
info (so it exists before a HW watchdog expires), and having it somewhere
"safe" from being disturbed by a HW reset.
in my mind, such a thing isnt really bound to the watchdog. while the
watchdog may be a common source, there are plenty of other sources
which would be addressed the same way. call it something like
"Continuous Logging for Unexpected Resets".
Probably no surprise to anyone, but this isn't just a requirement for
something like a set-top box, but I've also seen this specific set of
features needed in the carrier space. When you have machines out in
the field, largely unattended, it's important to know what was
happening prior to a reset / panic / oops / whatever and usually it's
not an acceptable solution to leave the machine dead until an operator
can go check on it. One solution I've seen used (mentioned elsewhere
in the thread) is pramfs, but that's dependent on the behaviour of
both hardware and software outside the control of the kernel, and if
your problem happens to be somewhere close to VFS it's possible the
piece you need to do your logging is currently on fire.
That, and pramfs seems to be largely untouched these days, but since I
haven't used it recently, I don't know if the lack of activity is due
to neglect or stability.
There's a MontaVista patent on PRAMFS and I think that most of times
when a company hears this thing it skips quickly this solution.
Sometimes ago I sent a porting to 2.6.24, but I didn't receive any response.
Either way, the soltuion I've seen is to use a low-level interface to
some persistent storage on the board somewhere that provides a
userspace interface but is mainly focused on doing 'flight recorder'
type activities. Capturing kernel logs, IRQ tracing, recording
scheduler decisions, and so on.
I don't know if anyone here is involved in the carrier space or not,
but it sounds like at least this feature is the same (or very similar)
to the one we're discussing as a potential new requirement for CGL v5.
Yes I agree, several carrier-grade distribution like MontaVista or
WindRiver have several solutions for these problems, because they are
very important in carrier space.
--
Marco Stornelli
Technical Development Engineer
CoRiTeL - Consorzio di Ricerca sulle Telecomunicazioni
http://www.coritel.it
marco.stornelli@xxxxxxxxxx
+39 06 72582838
--
To unsubscribe from this list: send the line "unsubscribe linux-embedded" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html