Re: [PATCH 0/2] Generic hardware error reporting support

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 20 Nov 2010 15:57:40 -0800

Hmm. This seems to have gotten bounced by a bad smtp setup here
locally. Sorry if you get it twice..

            Linus

On Sat, Nov 20, 2010 at 8:04 AM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Nov 19, 2010 at 11:11 PM, huang ying
> <huang.ying.caritas@xxxxxxxxx> wrote:
>> On Sat, Nov 20, 2010 at 10:15 AM, Linus Torvalds
>>> Bah. Many machine checks _were_ software errors. They were things like
>>> the BIOS not clearing some old pending state etc.
>>
>> I think the BIOS error should be reported to hardware vendor instead
>> of software vendor. Do you think so?
>
> They won't care. The only people who care are _us_. Software people.
> We may be able to work around a broken BIOS.
>
> Also, sometimes the machine checks are really our fault. Read the
> Intel documentation on page tables etc, it says that you can get
> machine checks if you inconsistent page attributes. Or maybe that was
> AMD.
>
> The point is, it's simply not _true_ that hardware errors are always a
> hardware bug. It never has been.
>
> And it's not _true_ that people care about them the same way. The only
> thing that is true is that a sysadmin wants to see them, but he wants
> to see them _exactly_ the same way he wants to see a kernel oops etc.
>
>>> IT HAS NOTHING WHAT-SO-EVER TO DO WITH HOW OFTEN THE ERRORS HAPPEN.
>>
>> Because some external cause like cosmic rays and electromagnetic
>> interference can cause hardware errors too. We need error counting to
>> distinguish between external caused hardware errors and real hardware
>> errors.
>
> Do you really think that a system administrator is too stupid to count to three?
>
> Yes, admittedly I've met some people like that. But no, "cosmic rays"
> do not change anything.
>
> People have had this for _ages_ with simple parity-protected RAM (with
> ECC just being another fancier form of it). People _know_.
>
> If you get an ECC report randomly once a month per machine, you know
> it's something like cosmic rays.
>
> And if you notice that _one_ of your machines gets five ECC errors per
> minute, you know it's something else. As an MIS person you might still
> decide keep the dang thing, because it's just the print server for the
> admin people, and you know that your paycheck is handled by another
> machine. But if it's the Quake server, you realize that it needs to be
> replaced _today_.
>
> See? That's not the kind of rational decision that some automated
> program can make.
>
> It really is that simple. No amount of "automatic counting" will ever
> help you. Quite the reverse. It will just complicate the thing.
>
>> So, do you agree that we need some tool oriented interface in addition
>> to printk?
>
> No. Any such tool will just _hide_ the information from the MIS people
> who don't even know about it.
>
> But you could certainly make a simple agreed-upon format. We have BUG:
> and WARNING: in the kernel logs. Why not HWPROBLEM: or something?
>
> MIS people love their perl scripts. And the people who can't do perl
> can still use the standard log tools.
>
>                               Linus
>
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html