Hmm. This seems to have gotten bounced by a bad smtp setup here locally. Sorry if you get it twice.. Linus On Sat, Nov 20, 2010 at 8:04 AM, Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > On Fri, Nov 19, 2010 at 11:11 PM, huang ying > <huang.ying.caritas@xxxxxxxxx> wrote: >> On Sat, Nov 20, 2010 at 10:15 AM, Linus Torvalds >>> Bah. Many machine checks _were_ software errors. They were things like >>> the BIOS not clearing some old pending state etc. >> >> I think the BIOS error should be reported to hardware vendor instead >> of software vendor. Do you think so? > > They won't care. The only people who care are _us_. Software people. > We may be able to work around a broken BIOS. > > Also, sometimes the machine checks are really our fault. Read the > Intel documentation on page tables etc, it says that you can get > machine checks if you inconsistent page attributes. Or maybe that was > AMD. > > The point is, it's simply not _true_ that hardware errors are always a > hardware bug. It never has been. > > And it's not _true_ that people care about them the same way. The only > thing that is true is that a sysadmin wants to see them, but he wants > to see them _exactly_ the same way he wants to see a kernel oops etc. > >>> IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN. >> >> Because some external cause like cosmic rays and electromagnetic >> interference can cause hardware errors too. We need error counting to >> distinguish between external caused hardware errors and real hardware >> errors. > > Do you really think that a system administrator is too stupid to count to three? > > Yes, admittedly I've met some people like that. But no, "cosmic rays" > do not change anything. > > People have had this for _ages_ with simple parity-protected RAM (with > ECC just being another fancier form of it). People _know_. > > If you get an ECC report randomly once a month per machine, you know > it's something like cosmic rays. > > And if you notice that _one_ of your machines gets five ECC errors per > minute, you know it's something else. As an MIS person you might still > decide keep the dang thing, because it's just the print server for the > admin people, and you know that your paycheck is handled by another > machine. But if it's the Quake server, you realize that it needs to be > replaced _today_. > > See? That's not the kind of rational decision that some automated > program can make. > > It really is that simple. No amount of "automatic counting" will ever > help you. Quite the reverse. It will just complicate the thing. > >> So, do you agree that we need some tool oriented interface in addition >> to printk? > > No. Any such tool will just _hide_ the information from the MIS people > who don't even know about it. > > But you could certainly make a simple agreed-upon format. We have BUG: > and WARNING: in the kernel logs. Why not HWPROBLEM: or something? > > MIS people love their perl scripts. And the people who can't do perl > can still use the standard log tools. > > Linus > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html