On 2023/8/31 17:06, Helge Deller wrote: > On 8/31/23 05:29, Shuai Xue wrote: >> On 2023/8/31 06:18, Will Deacon wrote: >>> On Mon, Aug 28, 2023 at 09:41:55AM +0800, Shuai Xue wrote: >>>> On 2023/8/22 09:15, Shuai Xue wrote: >>>>> On 2023/8/21 18:50, Will Deacon wrote: >>>>>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c >>>>>>> index 3fe516b32577..38e2186882bd 100644 >>>>>>> --- a/arch/arm64/mm/fault.c >>>>>>> +++ b/arch/arm64/mm/fault.c >>>>>>> @@ -679,6 +679,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >>>>>>> } else if (fault & (VM_FAULT_HWPOISON_LARGE | VM_FAULT_HWPOISON)) { >>>>>>> unsigned int lsb; >>>>>>> >>>>>>> + pr_err("MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n", >>>>>>> + current->comm, current->pid, far); >>>>>>> lsb = PAGE_SHIFT; >>>>>>> if (fault & VM_FAULT_HWPOISON_LARGE) >>>>>>> lsb = hstate_index_to_shift(VM_FAULT_GET_HINDEX(fault)); >>>>>> >>>>>> Hmm, I'm not convinced by this. We have 'show_unhandled_signals' already, >>>>>> and there's plenty of code in memory-failure.c for handling poisoned pages >>>>>> reported by e.g. GHES. I don't think dumping extra messages in dmesg from >>>>>> the arch code really adds anything. >>>>> >>>>> I see the show_unhandled_signals() will dump the stack but it rely on >>>>> /proc/sys/debug/exception-trace be set. >>>>> >>>>> The memory failure is the top issue in our production cloud and also other hyperscalers. >>>>> We have received complaints from our operations engineers and end users that processes >>>>> are being inexplicably killed :(. Could you please consider add a message? >>> >>> I don't have any objection to logging this stuff somehow, I'm just not >>> convinced that the console is the best place for that information in 2023. >>> Is there really nothing better? > >> I agree that console might not the better place, but it still plays an important role. >> IMO the most direct idea for end user to check what happened is to check by viewing >> the dmesg. In addition, we deployed some log store service collects all cluster dmesg >> from /var/log/kern. > > Right, pr_err() is not just console. > It ends up in the syslog, which ends up in a lot of places, e.g. through syslog forwarding. > Most monitoring tools monitor the syslog as well. > > So, IMHO pr_err() is the right thing. > > Helge > Totally agreed. Thank you. Best Regards, Shuai