On Mon, Jan 27, 2025 at 10:39:44AM -0800, Guenter Roeck wrote: > On 1/27/25 10:30, Paul Fertser wrote: > > Hi Guenter, > > > > On Mon, Jan 27, 2025 at 09:29:39AM -0800, Guenter Roeck wrote: > > > On 1/27/25 08:40, Winiarska, Iwona wrote: > > > > On Thu, 2025-01-23 at 15:20 +0300, Paul Fertser wrote: > > > > > When an Icelake or Sapphire Rapids CPU isn't providing the maximum and > > > > > critical thresholds for particular DIMM the driver should return an > > > > > error to the userspace instead of giving it stale (best case) or wrong > > > > > (the structure contains all zeros after kzalloc() call) data. > > > > > > > > > > The issue can be reproduced by binding the peci driver while the host is > > > > > fully booted and idle, this makes PECI interaction unreliable enough. > > > > > > > > > > Fixes: 73bc1b885dae ("hwmon: peci: Add dimmtemp driver") > > > > > Fixes: 621995b6d795 ("hwmon: (peci/dimmtemp) Add Sapphire Rapids support") > > > > > Cc: stable@xxxxxxxxxxxxxxx > > > > > Signed-off-by: Paul Fertser <fercerpav@xxxxxxxxx> > > > > > > > > Hi! > > > > > > > > Thank you for the patch. > > > > Did you have a chance to test it with OpenBMC dbus-sensors? > > > > In general, the change looks okay to me, but since it modifies the behavior > > > > (applications will need to handle this, and returning an error will happen more > > > > often) we need to confirm that it does not cause any regressions for userspace. > > > > > > > > > > I would also like to understand if the error is temporary or permanent. > > > If it is permanent, the attributes should not be created in the first > > > place. It does not make sense to have limit attributes which always report > > > -ENODATA. > > > > The error is temporary. The underlying reason is that when host CPUs > > go to deep enough idle sleep state (probably C6) they stop responding > > to PECI requests from BMC. Once something starts running the CPU > > leaves C6 and starts responding and all the temperature data > > (including the thresholds) becomes available again. > > > > Thanks. > > Next question: Is there evidence that the thresholds change while the CPU > is in a deep sleep state (or, in other words, that they are indeed stale) ? > Because if not it would be (much) better to only report -ENODATA if the > thresholds are uninitialized, and it would be even better than that if the > limits are read during initialization (and not updated at all) if they do > not change dynamically.