Resending to the whole list since the last time it looks like gmail had decided I wanted to send HTML. 2008/5/27 Mike Frysinger <vapier.adi@xxxxxxxxx>: > > On Tue, May 27, 2008 at 6:31 PM, T Ziomek wrote: > > On Tue, May 27, 2008 at 06:27:58PM -0400, Mike Frysinger wrote: > >> On Tue, May 27, 2008 at 5:57 PM, David VomLehn wrote: > >> > Continuous Logging for Watchdog Timer Expiration > >> > ------------------------------------------------ > >> > We run with a watchdog timer that can reboot the system. When we reboot, we > >> > lose all of our status, making it very difficult to determine what went > >> > wrong. Fortunately, there is only one major cause of not refreshing the > >> > watchdog--a driver disabled interrupts for so long that the timer function > >> > that resets the watchdog timer never had a chance to run. So, a way to log > >> > what functions were enable and disabling interrupts on a continuous basis, > >> > along with a memory section that wouldn't be overwritten on reboot, would > >> > allow capturing the cause for these otherwise "silent" reboots. > >> > >> how do you propose addressing that ? hardware watchdogs reset the > >> hardware, so there is nothing software can do to recover information > >> like register state. as soon as the watchdog timer expires, the state > >> is gone forever. > >> -mike > > > > If I understand correctly David is talking about logging some trace-like > > info (so it exists before a HW watchdog expires), and having it somewhere > > "safe" from being disturbed by a HW reset. > > in my mind, such a thing isnt really bound to the watchdog. while the > watchdog may be a common source, there are plenty of other sources > which would be addressed the same way. call it something like > "Continuous Logging for Unexpected Resets". Probably no surprise to anyone, but this isn't just a requirement for something like a set-top box, but I've also seen this specific set of features needed in the carrier space. When you have machines out in the field, largely unattended, it's important to know what was happening prior to a reset / panic / oops / whatever and usually it's not an acceptable solution to leave the machine dead until an operator can go check on it. One solution I've seen used (mentioned elsewhere in the thread) is pramfs, but that's dependent on the behaviour of both hardware and software outside the control of the kernel, and if your problem happens to be somewhere close to VFS it's possible the piece you need to do your logging is currently on fire. That, and pramfs seems to be largely untouched these days, but since I haven't used it recently, I don't know if the lack of activity is due to neglect or stability. Either way, the soltuion I've seen is to use a low-level interface to some persistent storage on the board somewhere that provides a userspace interface but is mainly focused on doing 'flight recorder' type activities. Capturing kernel logs, IRQ tracing, recording scheduler decisions, and so on. I don't know if anyone here is involved in the carrier space or not, but it sounds like at least this feature is the same (or very similar) to the one we're discussing as a potential new requirement for CGL v5. -- -Joe. -- To unsubscribe from this list: send the line "unsubscribe linux-embedded" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html