On Wed, May 29, 2024 at 12:40 AM Stephen Horvath <s.horvath@xxxxxxxxxxxxxx> wrote: > > Hi Thomas, > > On 29/5/24 16:23, Thomas Weißschuh wrote: > > On 2024-05-29 10:58:23+0000, Stephen Horvath wrote: > >> On 29/5/24 09:29, Guenter Roeck wrote: > >>> On 5/28/24 09:15, Thomas Weißschuh wrote: > >>>> On 2024-05-28 08:50:49+0000, Guenter Roeck wrote: > >>>>> On 5/27/24 17:15, Stephen Horvath wrote: > >>>>>> On 28/5/24 05:24, Thomas Weißschuh wrote: > >>>>>>> On 2024-05-25 09:13:09+0000, Stephen Horvath wrote: > >>>>>>>> Don't forget it can also return `EC_FAN_SPEED_STALLED`. > > > > <snip> > > > >>>>>>> > >>>>>>> Thanks for the hint. I'll need to think about how to > >>>>>>> handle this better. > >>>>>>> > >>>>>>>> Like Guenter, I also don't like returning `-ENODEV`, > >>>>>>>> but I don't have a > >>>>>>>> problem with checking for `EC_FAN_SPEED_NOT_PRESENT` > >>>>>>>> in case it was removed > >>>>>>>> since init or something. > >>>>>>> > >>>>> > >>>>> That won't happen. Chromebooks are not servers, where one might > >>>>> be able to > >>>>> replace a fan tray while the system is running. > >>>> > >>>> In one of my testruns this actually happened. > >>>> When running on battery, one specific of the CPU sensors sporadically > >>>> returned EC_FAN_SPEED_NOT_PRESENT. > >>>> > >>> > >>> What Chromebook was that ? I can't see the code path in the EC source > >>> that would get me there. > >>> > >> > >> I believe Thomas and I both have the Framework 13 AMD, the source code is > >> here: > >> https://github.com/FrameworkComputer/EmbeddedController/tree/lotus-zephyr > > > > Correct. > > > >> The organisation confuses me a little, but Dustin has previous said on the > >> framework forums (https://community.frame.work/t/what-ec-is-used/38574/2): > >> > >> "This one is based on the Zephyr port of the ChromeOS EC, and tracks > >> mainline more closely. It is in the branch lotus-zephyr. > >> All of the model-specific code lives in zephyr/program/lotus. > >> The 13"-specific code lives in a few subdirectories off the main tree named > >> azalea." > > > > The EC code is at [0]: > > > > $ ectool version > > RO version: azalea_v3.4.113353-ec:b4c1fb,os > > RW version: azalea_v3.4.113353-ec:b4c1fb,os > > Firmware copy: RO > > Build info: azalea_v3.4.113353-ec:b4c1fb,os:7b88e1,cmsis:4aa3ff 2024-03-26 07:10:22 lotus@ip-172-26-3-226 > > Tool version: 0.0.1-isolate May 6 2024 none > > I can confirm mine is the same build too. > > > From the build info I gather it should be commit b4c1fb, which is the > > current HEAD of the lotus-zephyr branch. > > Lotus is the Framework 16 AMD, which is very similar to Azalea, the > > Framework 13 AMD, which I tested this against. > > Both share the same codebase. > > > >> Also I just unplugged my fan and you are definitely correct, the EC only > >> generates EC_FAN_SPEED_NOT_PRESENT for fans it does not have the capability > >> to support. Even after a reboot it just returns 0 RPM for an unplugged fan. > >> I thought about simulating a stall too, but I was mildly scared I was going > >> to break one of the tiny blades. > > > > I get the error when unplugging *the charger*. > > > > To be more precise: > > > > It does not happen always. > > It does not happen instantly on unplugging. > > It goes away after a few seconds/minutes. > > During the issue, one specific sensor reads 0xffff. > > > > Oh I see, I haven't played around with the temp sensors until now, but I > can confirm the last temp sensor (cpu@4c / temp4) will randomly (every > ~2-15 seconds) return EC_TEMP_SENSOR_ERROR (0xfe). > Unplugging the charger doesn't seem to have any impact for me. > The related ACPI sensor also says 180.8°C. > I'll probably create an issue or something shortly. > > I was mildly confused by 'CPU sensors' and 'EC_FAN_SPEED_NOT_PRESENT' in > the same sentence, but I'm now assuming you mean the temp sensor? > Same here. it might not matter as much if the values were the same, but EC_FAN_SPEED_NOT_PRESENT == 0xffff, and EC_TEMP_SENSOR_NOT_PRESENT==0xff, so they must not be confused with each other. EC_TEMP_SENSOR_NOT_PRESENT should be static as well, though, and not be returned randomly. Guenter > >>>>>>> Ok. > >>>>>>> > >>>>>>>> My approach was to return the speed as `0`, since > >>>>>>>> the fan probably isn't > >>>>>>>> spinning, but set HWMON_F_FAULT for `EC_FAN_SPEED_NOT_PRESENT` and > >>>>>>>> HWMON_F_ALARM for `EC_FAN_SPEED_STALLED`. > >>>>>>>> No idea if this is correct though. > >>>>>>> > >>>>>>> I'm not a fan of returning a speed of 0 in case of errors. > >>>>>>> Rather -EIO which can't be mistaken. > >>>>>>> Maybe -EIO for both EC_FAN_SPEED_NOT_PRESENT (which > >>>>>>> should never happen) > >>>>>>> and also for EC_FAN_SPEED_STALLED. > >>>>>> > >>>>>> Yeah, that's pretty reasonable. > >>>>>> > >>>>> > >>>>> -EIO is an i/o error. I have trouble reconciling that with > >>>>> EC_FAN_SPEED_NOT_PRESENT or EC_FAN_SPEED_STALLED. > >>>>> > >>>>> Looking into the EC source code [1], I see: > >>>>> > >>>>> EC_FAN_SPEED_NOT_PRESENT means that the fan is not present. > >>>>> That should return -ENODEV in the above code, but only for > >>>>> the purpose of making the attribute invisible. > >>>>> > >>>>> EC_FAN_SPEED_STALLED means exactly that, i.e., that the fan > >>>>> is present but not turning. The EC code does not expect that > >>>>> to happen and generates a thermal event in case it does. > >>>>> Given that, it does make sense to set the fault flag. > >>>>> The actual fan speed value should then be reported as 0 or > >>>>> possibly -ENODATA. It should _not_ generate any other error > >>>>> because that would trip up the "sensors" command for no > >>>>> good reason. > >>>> > >>>> Ack. > >>>> > >>>> Currently I have the following logic (for both fans and temp): > >>>> > >>>> if NOT_PRESENT during probing: > >>>> make the attribute invisible. > >>>> > >>>> if any error during runtime (including NOT_PRESENT): > >>>> return -ENODATA and a FAULT > >>>> > >>>> This should also handle the sporadic NOT_PRESENT failures. > >>>> > >>>> What do you think? > >>>> > >>>> Is there any other feedback to this revision or should I send the next? > >>>> > >>> > >>> No, except I'd really like to know which Chromebook randomly generates > >>> a EC_FAN_SPEED_NOT_PRESENT response because that really looks like a bug. > >>> Also, can you reproduce the problem with the ectool command ? > > > > Yes, the ectool command reports the same issue at the same time. > > > > The fan affected was always the sensor cpu@4c, which is > > compatible = "amd,sb-tsi". > > > >> I have a feeling it was related to the concurrency problems between ACPI and > >> the CrOS code that are being fixed in another patch by Ben Walsh, I was also > >> seeing some weird behaviour sometimes but I *believe* it was fixed by that. > > > > I don't think it's this issue. > > Ben's series at [1], is for MEC ECs which are the older Intel > > Frameworks, not the Framework 13 AMD. > > Yeah sorry, I saw it mentioned AMD and threw it into my kernel, I also > thought it stopped the 'packet too long' messages (for > EC_CMD_CONSOLE_SNAPSHOT) but it did not. > > Thanks, > Steve