Em Wed, 14 Aug 2013 07:43:22 +0200 Borislav Petkov <bp@xxxxxxxxx> escreveu: > On Tue, Aug 13, 2013 at 08:13:56PM +0000, Luck, Tony wrote: > > Generic tracepoints are architected to be able to fire at very high > > rates and log huge amounts of information. So we'd need something > > special to say just log these special tracepoints to network/serial. > > > > > Which reminds me, pstore could also be a good thing to use, in addition. > > > Only put error info there as it is limited anyway. > > > > Yes - space is very limited. I don't know how to assign priority for logging > > the dmesg data vs. some error logs. > > Didn't we say at some point, "log only the panic messsage which kills > the machine"? EDAC core allows those kind of things, and even panic when errors arrive: $ modinfo edac_core filename: /lib/modules/3.10.5-201.fc19.x86_64/kernel/drivers/edac/edac_core.ko ... parm: edac_pci_panic_on_pe:Panic on PCI Bus Parity error: 0=off 1=on (int) parm: edac_mc_panic_on_ue:Panic on uncorrected error: 0=off 1=on (int) parm: edac_mc_log_ue:Log uncorrectable error to console: 0=off 1=on (int) parm: edac_mc_log_ce:Log correctable error to console: 0=off 1=on (int) Those have 644 permission, so they can be changed at runtime. Of course, there are space for improvements. > However, we probably could use more the messages before that > catastrophic event because they could give us hints about what lead to > the panic but in that case maybe a limited pstore is the wrong logging > medium. > > Actually, I can imagine the full serial/network logs of "special" > tracepoints + dmesg to be the optimal thing. > > > If we just "printk()" the most important parts - then that data will > > automatically flow to the serial console and to pstore. > > Actually, does the pstore act like a circular buffer? Because if it > contains the last N relevant messages (for an arbitrary definition of > relevant) before the system dies, then that could more helpful than only > the error messages. > > And with the advent of UEFI, pretty much every system has a pstore. Too > bad that we have to limit it to 50% of size so that the boxes don't > brick. :-P > > > Then we have multiple paths for the critical bits of the error log > > - and the tracepoints give us more details for the cases where the > > machine doesn't spontaneously explode. > > Ok, let's sort: > > * First we have the not-so-critical hw error messages. We want to carry > those out-of-band, i.e. not in dmesg so that people don't have to parse > and collect dmesg but have a specialized solution which gives them > structured logs and tools can analyze, collect and ... those errors. > > * When a critical error happens, the above usage is not necessarily > advantageous anymore in the sense that, in order to debug what caused > the machine to crash, we don't simply necessarily want only the crash > message but also the whole system activity that lead to it. > > In which case, we probably actually want to turn off/ignore the error > logging tracepoints and write *only* to dmesg which goes out over serial > and to pstore. Right? > > Because in such cases I want to have *all* *relevant* messages that lead > to the explosion + the explosion message itself. > > Makes sense? Yes, no? Aspects I've missed? Makes sense to me. > > Thanks. > -- Cheers, Mauro -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html