David VomLehn <dvomlehn@xxxxxxxxx> writes: > On Tue, Jun 02, 2009 at 03:37:44PM -0500, James Bottomley wrote: > ... >> This is what made us suggest the presentation driven approach. We can >> send people who understand how the kernel development process out >> anointed as embedded maintainers. However, looking at the arch >> directory, you have a ton of new kids on the block. We wondered if, >> perhaps, rather than having seasoned kernel developers reach out to the >> embedded community, we might try giving the embedded community the >> opportunity to reach out to us. The topic of "flattened device tree" >> look interesting to me (perhaps because I'm a hardened device driver >> person and things like that always look interesting to me) ... if we can >> get a few more like that out of the woodwork, this approach might end up >> being successful. > > Failure reporting is the one area where embedded applications have > little overlap with other Linux application domains. The cable settop box > environment has: > o Limited peristent storage > o Low or no upstream bandwidth > o Little access to hundreds of thousands of devices in the field > > When a kernel panics in the field, we have no place to put a core dump > and, if we had a place to put it, it would take way too long to upload > it when the box comes back up. And most people just don't understand when > you knock at their door at midnight, JTAG probe in hand. > > We hook in a panic notifier and have it generate a really rich report. > At present, this report stays in memory until we reboot and send it > upstream (or write it to flash), but we could really write it to any > device with which we can use polled I/O (interrupts being questionable > at this point). Generic interfaces to support this would be useful. > > Many embedded devices have highly integrated stacks, so failures in user > space lead to device reboots, and you want to leverage much of the same > ability to store and send failure reports. > > Our failure report includes things you'd expect as well as various pieces > of history, such as: > o IRQs > o softirq dispatches (including max times) > o selected /proc info, e.g. /proc/meminfo > > We also report info on the current thread, like backtracing and > /proc/<pid>/maps, though I'm not sure it's as useful as it might be. > > Though I'm working on pushing this stuff out, other things that might be > helpful are: > o If you get to panic() by way of die(), you've lost the registers passed to > die(). We save a pointer off, but it's really a kludge. > o The implementation of die() varies from platform to platform and isn't even > called die() everywhere. > o It is truly nasty trying to get /proc information when you are in a panic > situation--any semaphores being held are not going to be released, so you > have to duplicate a lot of the code, minus the semaphores. Pretty gross > and there is no way our implementation will be acceptable. > o Increased reporting on what's happening in user/kernel space interaction. > For example, a signal sent in good faith might kill a buggy process. It > would be helpful to log signals that result in a process' death. > o Then there is more speculative stuff. For example, your caches would > have a copy of the most recently accessed code and data. If your > processor supports dumping cache, it might help determing what went wrong. Have you looked at doing this with the kexec on panic infrastructure? Things like mkdumpfile can now have enough information to dump this. If you are space constrained a stand alone executable could be used instead of a linux kernel to marshal the information into your buffer. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-embedded" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html