On Tue, Jun 02, 2009 at 03:37:44PM -0500, James Bottomley wrote: ... > This is what made us suggest the presentation driven approach. We can > send people who understand how the kernel development process out > anointed as embedded maintainers. However, looking at the arch > directory, you have a ton of new kids on the block. We wondered if, > perhaps, rather than having seasoned kernel developers reach out to the > embedded community, we might try giving the embedded community the > opportunity to reach out to us. The topic of "flattened device tree" > look interesting to me (perhaps because I'm a hardened device driver > person and things like that always look interesting to me) ... if we can > get a few more like that out of the woodwork, this approach might end up > being successful. Failure reporting is the one area where embedded applications have little overlap with other Linux application domains. The cable settop box environment has: o Limited peristent storage o Low or no upstream bandwidth o Little access to hundreds of thousands of devices in the field When a kernel panics in the field, we have no place to put a core dump and, if we had a place to put it, it would take way too long to upload it when the box comes back up. And most people just don't understand when you knock at their door at midnight, JTAG probe in hand. We hook in a panic notifier and have it generate a really rich report. At present, this report stays in memory until we reboot and send it upstream (or write it to flash), but we could really write it to any device with which we can use polled I/O (interrupts being questionable at this point). Generic interfaces to support this would be useful. Many embedded devices have highly integrated stacks, so failures in user space lead to device reboots, and you want to leverage much of the same ability to store and send failure reports. Our failure report includes things you'd expect as well as various pieces of history, such as: o IRQs o softirq dispatches (including max times) o selected /proc info, e.g. /proc/meminfo We also report info on the current thread, like backtracing and /proc/<pid>/maps, though I'm not sure it's as useful as it might be. Though I'm working on pushing this stuff out, other things that might be helpful are: o If you get to panic() by way of die(), you've lost the registers passed to die(). We save a pointer off, but it's really a kludge. o The implementation of die() varies from platform to platform and isn't even called die() everywhere. o It is truly nasty trying to get /proc information when you are in a panic situation--any semaphores being held are not going to be released, so you have to duplicate a lot of the code, minus the semaphores. Pretty gross and there is no way our implementation will be acceptable. o Increased reporting on what's happening in user/kernel space interaction. For example, a signal sent in good faith might kill a buggy process. It would be helpful to log signals that result in a process' death. o Then there is more speculative stuff. For example, your caches would have a copy of the most recently accessed code and data. If your processor supports dumping cache, it might help determing what went wrong. > James -- To unsubscribe from this list: send the line "unsubscribe linux-embedded" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html