Re: [RFC PATCH 3/4] acpi: apei: Do not panic() in NMI because of GHES messages

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Alex,

On 04/16/2018 10:59 PM, Alex G. wrote:
> On 04/13/2018 11:38 AM, James Morse wrote:
>> This assumes a cache-invalidate will clear the error, which I don't think we're
>> guaranteed on arm.
>> It also destroys any adjacent data, "everyone's happy" includes the thread that >> got a chunk of someone-else's stack frame, I don't think it will be happy for
>> very long!
>
> Hmm, no cache-line (or page) invalidation on arm64? How does
> dma_map/unmap_*() work then? You may not guarantee to fix the error, but

There are cache-invalidate instructions, but I don't think 'solving' a RAS error with them is the right thing to do.


> I don't buy into the "let's crash without trying" argument.

Our 'cache writeback granule' may be as large as 2K, so we may have to invalidate up to 2K of data to convince the hardware this address is okay again.

All we've done here is differently-corrupt the data so that it no longer generates a RAS fault, it just gives you the wrong data instead. Cache-invalidation is destructive.

I don't think there is a one-size-fits-all solution here.


>> (this is a side issue for AER though)
>
> Somebody muddled up AER with these tables, so we now have to worry about
> it. :)

Eh? I see there is a v2, maybe I'll understand this comment once I read it.


>>> How does FFS handle race conditions that can occur when accessing HW
>>> concurrently with the OS? I'm told it's the main reasons why BIOS
>>> doesn't release unused cores from SMM early.
>>
>> This is firmware's problem, it depends on whether there is any hardware that is >> shared with the OS. Some hardware can be marked 'secure' in which case only >> firmware can access it, alternatively firmware can trap or just disable the OS's
>> access to the shared hardware.
>
> It's everyone's problem. It's the firmware's responsibility.

It depends on the SoC design. If there is no hardware that the OS and firmware both need to access to handle an error then I don't think firmware needs to do this.


>> For example, with the v8.2 RAS Extensions, there are some per-cpu error
>> registers. Firmware can disable these for the OS, so that it always reads 0 from
>> them. Instead firmware takes the error via FF, reads the registers from
>> firmware, and dumps CPER records into the OS's memory.
>>
>> If there is a shared hardware resource that both the OS and firmware may be >> accessing, yes firmware needs to pull the other CPUs in, but this depends on the
>> SoC design, it doesn't necessarily happen.
>
> The problem with shared resources is just a problem. I've seen systems
> where all 100 cores are held up for 300+ ms. In latency-critical
> applications reliability drops exponentially. Am I correct in assuming
> your answer would be to "hide" more stuff from the OS?

No, I'm not a fan of firmware cycle stealing. If you can design the SoC or
firmware so that the 'all CPUs' stuff doesn't need to happen, then you won't get these issues. (I don't design these things, I'm sure they're much more complicated
than I think!)

Because the firmware is SoC-specific, so it only needs to do exactly what is necessary.


>>> I think the idea of firmware-first is broken. But it's there, it's
>>> shipping in FW, so we have to accommodate it in SW.
>>
>> Part of our different-views here is firmware-first is taking something away from
>> you, whereas for me its giving me information that would otherwise be in
>> secret-soc-specific registers.
>
> Under this interpretation, FFS is a band-aid to the problem of "secret"
> registers. "Secret" hardware doesn't really fit well into the idea of an
> OS [1].

Sorry, I'm being sloppy with my terminology, by secret-soc-specific I mean either Linux can't access them (firmware privilege-level only) or Linux can't reasonably know where these registers are, as they're soc-specific and vary by manufacture.


>>> And linux can handle a wide subset of MCEs just fine, so the
>>> ghes_is_deferrable() logic would, under my argument, agree to pass
>>> execution to the actual handlers.
>>
>> For some classes of error we can't safely get there.
>
> Optimize for the common case.

At the expense of reliability?


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux