On 2024-05-29 10:58:23+0000, Stephen Horvath wrote: > On 29/5/24 09:29, Guenter Roeck wrote: > > On 5/28/24 09:15, Thomas Weißschuh wrote: > > > On 2024-05-28 08:50:49+0000, Guenter Roeck wrote: > > > > On 5/27/24 17:15, Stephen Horvath wrote: > > > > > On 28/5/24 05:24, Thomas Weißschuh wrote: > > > > > > On 2024-05-25 09:13:09+0000, Stephen Horvath wrote: > > > > > > > Don't forget it can also return `EC_FAN_SPEED_STALLED`. <snip> > > > > > > > > > > > > Thanks for the hint. I'll need to think about how to > > > > > > handle this better. > > > > > > > > > > > > > Like Guenter, I also don't like returning `-ENODEV`, > > > > > > > but I don't have a > > > > > > > problem with checking for `EC_FAN_SPEED_NOT_PRESENT` > > > > > > > in case it was removed > > > > > > > since init or something. > > > > > > > > > > > > > > That won't happen. Chromebooks are not servers, where one might > > > > be able to > > > > replace a fan tray while the system is running. > > > > > > In one of my testruns this actually happened. > > > When running on battery, one specific of the CPU sensors sporadically > > > returned EC_FAN_SPEED_NOT_PRESENT. > > > > > > > What Chromebook was that ? I can't see the code path in the EC source > > that would get me there. > > > > I believe Thomas and I both have the Framework 13 AMD, the source code is > here: > https://github.com/FrameworkComputer/EmbeddedController/tree/lotus-zephyr Correct. > The organisation confuses me a little, but Dustin has previous said on the > framework forums (https://community.frame.work/t/what-ec-is-used/38574/2): > > "This one is based on the Zephyr port of the ChromeOS EC, and tracks > mainline more closely. It is in the branch lotus-zephyr. > All of the model-specific code lives in zephyr/program/lotus. > The 13"-specific code lives in a few subdirectories off the main tree named > azalea." The EC code is at [0]: $ ectool version RO version: azalea_v3.4.113353-ec:b4c1fb,os RW version: azalea_v3.4.113353-ec:b4c1fb,os Firmware copy: RO Build info: azalea_v3.4.113353-ec:b4c1fb,os:7b88e1,cmsis:4aa3ff 2024-03-26 07:10:22 lotus@ip-172-26-3-226 Tool version: 0.0.1-isolate May 6 2024 none >From the build info I gather it should be commit b4c1fb, which is the current HEAD of the lotus-zephyr branch. Lotus is the Framework 16 AMD, which is very similar to Azalea, the Framework 13 AMD, which I tested this against. Both share the same codebase. > Also I just unplugged my fan and you are definitely correct, the EC only > generates EC_FAN_SPEED_NOT_PRESENT for fans it does not have the capability > to support. Even after a reboot it just returns 0 RPM for an unplugged fan. > I thought about simulating a stall too, but I was mildly scared I was going > to break one of the tiny blades. I get the error when unplugging *the charger*. To be more precise: It does not happen always. It does not happen instantly on unplugging. It goes away after a few seconds/minutes. During the issue, one specific sensor reads 0xffff. > > > > > > Ok. > > > > > > > > > > > > > My approach was to return the speed as `0`, since > > > > > > > the fan probably isn't > > > > > > > spinning, but set HWMON_F_FAULT for `EC_FAN_SPEED_NOT_PRESENT` and > > > > > > > HWMON_F_ALARM for `EC_FAN_SPEED_STALLED`. > > > > > > > No idea if this is correct though. > > > > > > > > > > > > I'm not a fan of returning a speed of 0 in case of errors. > > > > > > Rather -EIO which can't be mistaken. > > > > > > Maybe -EIO for both EC_FAN_SPEED_NOT_PRESENT (which > > > > > > should never happen) > > > > > > and also for EC_FAN_SPEED_STALLED. > > > > > > > > > > Yeah, that's pretty reasonable. > > > > > > > > > > > > > -EIO is an i/o error. I have trouble reconciling that with > > > > EC_FAN_SPEED_NOT_PRESENT or EC_FAN_SPEED_STALLED. > > > > > > > > Looking into the EC source code [1], I see: > > > > > > > > EC_FAN_SPEED_NOT_PRESENT means that the fan is not present. > > > > That should return -ENODEV in the above code, but only for > > > > the purpose of making the attribute invisible. > > > > > > > > EC_FAN_SPEED_STALLED means exactly that, i.e., that the fan > > > > is present but not turning. The EC code does not expect that > > > > to happen and generates a thermal event in case it does. > > > > Given that, it does make sense to set the fault flag. > > > > The actual fan speed value should then be reported as 0 or > > > > possibly -ENODATA. It should _not_ generate any other error > > > > because that would trip up the "sensors" command for no > > > > good reason. > > > > > > Ack. > > > > > > Currently I have the following logic (for both fans and temp): > > > > > > if NOT_PRESENT during probing: > > > make the attribute invisible. > > > > > > if any error during runtime (including NOT_PRESENT): > > > return -ENODATA and a FAULT > > > > > > This should also handle the sporadic NOT_PRESENT failures. > > > > > > What do you think? > > > > > > Is there any other feedback to this revision or should I send the next? > > > > > > > No, except I'd really like to know which Chromebook randomly generates > > a EC_FAN_SPEED_NOT_PRESENT response because that really looks like a bug. > > Also, can you reproduce the problem with the ectool command ? Yes, the ectool command reports the same issue at the same time. The fan affected was always the sensor cpu@4c, which is compatible = "amd,sb-tsi". > I have a feeling it was related to the concurrency problems between ACPI and > the CrOS code that are being fixed in another patch by Ben Walsh, I was also > seeing some weird behaviour sometimes but I *believe* it was fixed by that. I don't think it's this issue. Ben's series at [1], is for MEC ECs which are the older Intel Frameworks, not the Framework 13 AMD. [0] https://github.com/FrameworkComputer/EmbeddedController [1] https://lore.kernel.org/lkml/20240515055631.5775-1-ben@xxxxxxxxxx/