On 8/21/23 2:01 PM, Rafael J. Wysocki wrote: > On Mon, Aug 21, 2023 at 7:52 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote: >> >> On Mon, Aug 21, 2023 at 7:35 PM Limonciello, Mario >> <mario.limonciello@xxxxxxx> wrote: >>> >>> >>> >>> On 8/21/2023 12:29 PM, Rafael J. Wysocki wrote: >>>> On Mon, Aug 21, 2023 at 7:17 PM Limonciello, Mario >>>> <mario.limonciello@xxxxxxx> wrote: >>>>> >>>>> On 8/21/2023 12:12 PM, Rafael J. Wysocki wrote: >>>>> <snip> >>>>>>> I was just talking to some colleagues about PHAT recently as well. >>>>>>> >>>>>>> The use case that jumps out is "system randomly rebooted while I was >>>>>>> doing XYZ". You don't know what happened, but you keep using your >>>>>>> system. Then it happens again. >>>>>>> >>>>>>> If the reason for the random reboot is captured to dmesg you can cross >>>>>>> reference your journal from the next boot after any random reboot and >>>>>>> get the reason for it. If a user reports this to a Gitlab issue tracker >>>>>>> or Bugzilla it can be helpful in establishing a pattern. >>>>>>> >>>>>>>>> The below location may be appropriate in that case: >>>>>>>>> /sys/firmware/acpi/ >>>>>>>> >>>>>>>> Yes, it may. > >>>>>>>>> We already have FPDT and BGRT being exported from there. >>>>>>>> >>>>>>>> In fact, all of the ACPI tables can be retrieved verbatim from >>>>>>>> /sys/firmware/acpi/tables/ already, so why exactly do you want the >>>>>>>> kernel to parse PHAT in particular? >>>>>>>> >>>>>>> >>>>>>> It's not to say that /sys/firmware/acpi/PHAT isn't useful, but having >>>>>>> something internal to the kernel "automatically" parsing it and saving >>>>>>> information to a place like the kernel log that is already captured by >>>>>>> existing userspace tools I think is "more" useful. >>>>>> >>>>>> What existing user space tools do you mean? Is there anything already >>>>>> making use of the kernel's PHAT output? >>>>>> >>>>> >>>>> I was meaning things like systemd already capture the kernel long >>>>> ringbuffer. If you save stuff like this into the kernel log, it's going >>>>> to be indexed and easier to grep for boots that had it. >>>>> >>>>>> And why can't user space simply parse PHAT by itself? >>>>>> > There are multiple ACPI tables that could be dumped into the kernel >>>>>> log, but they aren't. Guess why. >>>>> >>>>> Right; there's not reason it can't be done by userspace directly. >>>>> >>>>> Another way to approach this problem could be to modify tools that >>>>> excavate records from a reboot to also get PHAT. For example >>>>> systemd-pstore will get any kernel panics from the previous boot from >>>>> the EFI pstore and put them into /var/lib/systemd/pstore. >>>>> >>>>> No reason that couldn't be done automatically for PHAT too. >>>> >>>> I'm not sure about the connection between the PHAT dump in the kernel >>>> log and pstore. >>>> >>>> The PHAT dump would be from the time before the failure, so it is >>>> unclear to me how useful it can be for diagnosing it. However, after >>>> a reboot one should be able to retrieve PHAT data from the table >>>> directly and that may include some information regarding the failure. >>> >>> Right so the thought is that at bootup you get the last entry from PHAT >>> and save that into the log. >>> >>> Let's say you have 3 boots: >>> X - Triggered a random reboot >>> Y - Cleanly shut down >>> Z - Boot after a clean shut down >>> >>> So on boot Y you would have in your logs the reason that boot X rebooted. >> >> Yes, and the same can be retrieved from the PHAT directly from user >> space at that time, can't it? >> >>> On boot Z you would see something about how boot Y's reason. >>> >>>> >>>> With pstore, the assumption is that there will be some information >>>> relevant for diagnosing the failure in the kernel buffer, but I'm not >>>> sure how the PHAT dump from before the failure can help here? >>> >>> Alone it's not useful. >>> I had figured if you can put it together with other data it's useful. >>> For example if you had some thermal data in the logs showing which >>> component overheated or if you looked at pstore and found a NULL pointer >>> dereference. >> >> IIUC, the current PHAT content can be useful. The PHAT content from >> boot X (before the failure) which is what will be there in pstore >> after the random reboot, is of limited value AFAICS. > > To be more precise, I don't see why the kernel needs to be made a > man-in-the-middle between the firmware which is the source of the > information and user space that consumes it. I think that's a fair point. Is there a preferred set of tools that can be updated? If not, would it make sense to develop a set of common kernel tools for this? In my experience, it seems many folks use tools from their vendors or custom tools. Thanks, Yazen