Re: Kernel 6.7+ broke under-powering of my RX 6700XT. (Archlinux, mesa/amdgpu)

Thorsten Leemhuis <regressions@xxxxxxxxxxxxx> · Wed, 21 Feb 2024 16:44:18 +0100

[+Linus, as we seem to have reached the point in the discussion about
this regression where that is likely for the best.

And just for the record: I'm *not* doing that because I'm disappointed,
angry, or something. I can relate to the point that was made in the mail
I'm replying to. It's just that this is a tricky situation due to the
"hardware might be damaged or work unreliable" aspect, so it's best if
we all know how Linus wants this to be handled.]

BTW, thread starts here:
https://lore.kernel.org/all/ae64f04d-6e94-4da4-a740-78ea94e0552c@xxxxxxxxxxxxxxxxxxx/

On 21.02.24 16:15, Christian König wrote:
> Am 21.02.24 um 07:06 schrieb Linux regression tracking (Thorsten Leemhuis):
>> On 20.02.24 21:18, Alex Deucher wrote:
>>> On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@xxxxxxxxx> wrote:
>>>> If the increased low range is allowed via boot option, like in proposed
>>>> patch, user clearly made an intentional decision. Undefined, but won't
>>>> fry his hardware for sure. Undefined is also overclocking in that
>>>> matter. You can go out of range with ratio of voltage vs
>>>> frequency(still
>>>> within vendor's limits) for example and crash the system.
>>> This whole thing reminds me of this:
>>> https://xkcd.com/1172/
>>> The problem is another module parameter is another interface to
>>> maintain and validate.
>> Yup, of course, all that is understood.
>>
>> But we have this "no regressions" rule for a reason. Adhering to it
>> strictly would afaics be counter-productive in this situation, but give
>> users some way to manually do what was possible before out-of-the box
>> IMHO is the minimum we should do.
>>
>> Maybe just allow that parameter only up to a certain recent GPU
>> generation, that way you won't have to deal with that at some point in
>> the future.
>>
>>>   Moreover, we've had a number of cases in the
>>> past where users have under or overclocked and reported bugs or
>>> stability issues and it did not come to light that they were doing
>>> that until we'd already spent a good deal of time trying to debug the
>>> issue.
>> Taint the kernel when that module parameter is used? We iirc have a
>> taint bit exactly for this sort of situation. Sure, such reports will
>> still happen, but then you at least have an indicator to spot them.
> 
> Let me recap what happened here:
> 
> 1. AMD is the GPU manufacturer, but apart from a few exceptions doesn't
> assemble boards.
> 
> 2. Vendors take AMDs GPUs and assemble them together with power
> regulators, memory and a bunch of other components into PCIe board.
> 
> 3. AMD provides a vendor agnostic driver and for this to work vendors
> describe to the min/max voltage their power regulators can do in some
> flash memory.
> 
> 4. Hardware engineers point out that AMDs open source drivers are not
> respecting the min value.
> 
> 5. In response a patch was applied to respect that value and not use
> something outside of the hardware specification the vendor provided.
> 
> I'm not sure about it but I think AMD need to respect the min/max values
> simply by contract and it's not really an option to not do that.
> 
> If someone really want to run your hardware outside the vendor
> recommended values that person can still patch the driver to ignore the
> limits. It's just that then AMD is not responsible for any damage
> resulting from that.
> 
> So as far as I can see the request to make that a module option is a
> no-go, especially since hardware engineers have explicitly pointed out
> that we have to do this in the software stack.

As mentioned above: I can relate to that point of view. But in the end
this is the kernel and "no regressions" is something that is considered
the #1 rule in the development process and especially so by Linus
himself. So let's see if he has something to say here. If he doesn't
reply I'll rest my case. :-D

Ciao, Thorsten

>>>   This obviously can still happen if you allow any sort of over
>>> or underclocking, but at least if you stick to the limits you are
>>> staying within the bounding box of the design.
>>>
>>> Alex
>>>
>>>> On 2/20/24 19:09, Alex Deucher wrote:
>>>>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@xxxxxxxxx> wrote:
>>>>>> For Windows, apps like MSI Afterburner is the one to try and what
>>>>>> most
>>>>>> people go for. Using it in the past myself, I would be surprised
>>>>>> if it
>>>>>> adhered to such a high min power cap. But even if it did, why
>>>>>> would we
>>>>>> have to.
>>>>>>
>>>>>> Relying on vendors cap in this case has already proven wrong because
>>>>>> things worked for quite some time already and people reported saving
>>>>>> significant amount of watts, in my case 90W(!) for <10% perf.
>>>>>>
>>>>>> Therefore this talk about safety seems rather strange to me and
>>>>>> especially so when we are talking about min_cap. Or name me a single
>>>>>> case where someone fried his card due to "too low power" set in said
>>>>>> variable. Now there was a report, where by going way too low, driver
>>>>>> goes opposite into max power. That's it. That can be easily
>>>>>> detected(vents going crazy etc.) and reverted. It is a max_cap that
>>>>>> protect HW(also above scenario), not a min_cap. Feel free to
>>>>>> adhere to
>>>>>> safety standards with that one.
>>>>> Because operation outside of the design bounding box is undefined.  It
>>>>> might work for some boards but not others.  It's possible some of the
>>>>> logic in the firmware or some of the components used on the board may
>>>>> not work correctly below a certain limit, or the voltage regulators
>>>>> used on a specific board have a minimum requirement that would not be
>>>>> an issue if you stick the bounding box.
>>>>>
>>>>> Alex
>>>>>
>>>>>> As for solution, what some suggested already exist - a patch
>>>>>> posted by
>>>>>> fililip on gitlab is probably the way most of you would agree. It
>>>>>> introduce a variable that can be set during boot to override min_cap.
>>>>>> But he did not pull requested it, so please, if any one of you who
>>>>>> have
>>>>>> access to code and merge kernel would be kind enough to implement it.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2/20/24 16:46, Alex Deucher wrote:
>>>>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten
>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>>> On 20.02.24 16:27, Hans de Goede wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 2/20/24 16:15, Alex Deucher wrote:
>>>>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking
>>>>>>>>>> (Thorsten
>>>>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote:
>>>>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking
>>>>>>>>>>>> (Thorsten
>>>>>>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote:
>>>>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote:
>>>>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote:
>>>>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for
>>>>>>>>>>>>>>> my GPU (RX 6700XT,
>>>>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as
>>>>>>>>>>>>>>> before(to 115W),
>>>>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a
>>>>>>>>>>>>>>> variable read-only
>>>>>>>>>>>>>>> even for root. This is not of above apps issue but of the
>>>>>>>>>>>>>>> kernel, I read
>>>>>>>>>>>>>>> similar issues from other bug reports of above apps. I
>>>>>>>>>>>>>>> downgraded to v6.6.10
>>>>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before.
>>>>>>>>>>>>> For the record and everyone that lands here: the cause is
>>>>>>>>>>>>> known now
>>>>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting
>>>>>>>>>>>>> power1_cap_min
>>>>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>
>>>>>>>>>>>>> Other mentions:
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137
>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992
>>>>>>>>>>>>>
>>>>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now
>>>>>>>>>>>>> CCed) yet on
>>>>>>>>>>>>> this there (but might have missed something!). From what I
>>>>>>>>>>>>> can see I
>>>>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a
>>>>>>>>>>>>> revert
>>>>>>>>>>>>> overall might be a bad idea here. We'll see I guess.
>>>>>>>>>>>> The change aligns the driver what has been validated on each
>>>>>>>>>>>> board
>>>>>>>>>>>> design.  Windows uses the same limits.  Using values lower
>>>>>>>>>>>> than the
>>>>>>>>>>>> validated range can lead to undefined behavior and could
>>>>>>>>>>>> potentially
>>>>>>>>>>>> damage your hardware.
>>>>>>>>>>> Thx for the reply! Yeah, I was expecting something along
>>>>>>>>>>> those lines.
>>>>>>>>>>>
>>>>>>>>>>> Nevertheless it afaics still is a regression in the eyes of
>>>>>>>>>>> many users.
>>>>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we
>>>>>>>>>>> can find
>>>>>>>>>>> some solution here so that users that really want to, can
>>>>>>>>>>> continue to do
>>>>>>>>>>> what was possible out-of-the box before. Is that possible to
>>>>>>>>>>> realize or
>>>>>>>>>>> even supported already?
>>>>>>>>>>>
>>>>>>>>>>> And sure, those users would be running their hardware outside
>>>>>>>>>>> of its
>>>>>>>>>>> specifications. But is that different from overclocking
>>>>>>>>>>> (which the
>>>>>>>>>>> driver allows, doesn't it? If not by all means please correct
>>>>>>>>>>> me!)?
>>>>>>>>>> Sure.  The driver has always had upper bound limits for
>>>>>>>>>> overclocking,
>>>>>>>>>> this change adds lower bounds checking for underclocking as well.
>>>>>>>>>> When the silicon validation teams set the bounding box for a
>>>>>>>>>> device,
>>>>>>>>>> they set a range of values where it's reasonable to operate
>>>>>>>>>> based on
>>>>>>>>>> the characteristics of the design.
>>>>>>>>>>
>>>>>>>>>> If we did want to allow extended underclocking, we need a big
>>>>>>>>>> warning
>>>>>>>>>> in the logs at the very least.
>>>>>>>>> Requiring a module-option to be set to allow this, as well as a
>>>>>>>>> big
>>>>>>>>> warning in the logs sounds like a good solution to me.
>>>>>>>> Yeah, especially as it sounds from some of the reports as if some
>>>>>>>> vendors did a really bad job when it came to setting the proper
>>>>>>>> lower-bound limits are now adhered -- and thus higher then what
>>>>>>>> we used
>>>>>>>> out-of-the box before 1958946858a62b was applied.
>>>>>>>>
>>>>>>>> Side note: I assume those "lower bounds checking" is done round
>>>>>>>> about
>>>>>>>> the same way by the Windows driver? Does that one allow users to go
>>>>>>>> lower somehow? Say after modifying the registry or something
>>>>>>>> like that?
>>>>>>>> Or through external tools?
>>>>>>> Windows uses the same limit.  I'm not aware of any way to
>>>>>>> override the
>>>>>>> limit on windows off hand.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>>> Ciao, Thorsten
>>>>>>>>
>>>>>>>>>>>>> Roman posted something that apparently was meant to go to
>>>>>>>>>>>>> the list, so
>>>>>>>>>>>>> let me put it here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> """
>>>>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to
>>>>>>>>>>>>> be merged,
>>>>>>>>>>>>> discussion is on gitlab link below.
>>>>>>>>>>>>>
>>>>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using
>>>>>>>>>>>>> original addr.)
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> it seems that commit was already found(see user's
>>>>>>>>>>>>>> 'fililip' comment):
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183
>>>>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39
>>>>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@xxxxxxx>
>>>>>>>>>>>>>> Date:   Thu Oct 12 09:33:45 2023 +0800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        drm/amd/pm: Support for getting power1_cap_min value
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Support for getting power1_cap_min value on smu13
>>>>>>>>>>>>>> and smu11.
>>>>>>>>>>>>>>        For other Asics, we still use 0 as the default value.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        Signed-off-by: Ma Jun <Jun.Ma2@xxxxxxx>
>>>>>>>>>>>>>>        Reviewed-by: Kenneth Feng <kenneth.feng@xxxxxxx>
>>>>>>>>>>>>>>        Signed-off-by: Alex Deucher
>>>>>>>>>>>>>> <alexander.deucher@xxxxxxx>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> However, this is not good as it remove under-powering
>>>>>>>>>>>>>> range too far. I
>>>>>>>>>>>>> was getting only about 7% less performance but 90W(!) less
>>>>>>>>>>>>> consumption
>>>>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of
>>>>>>>>>>>>> options and
>>>>>>>>>>>>> freedom have to stick to such very high reference for min
>>>>>>>>>>>>> values without
>>>>>>>>>>>>> ability to override them through some sys ctrls. Commit was
>>>>>>>>>>>>> done by amd
>>>>>>>>>>>>> guy and I wonder if because of maybe this post that I made
>>>>>>>>>>>>> few months
>>>>>>>>>>>>> ago(business strategy?):
>>>>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/
>>>>>>>>>>>>>> This is not a dangerous OC upwards where I can understand
>>>>>>>>>>>>>> desire to
>>>>>>>>>>>>> protect HW, it is downward, having min cap at 190W when
>>>>>>>>>>>>> card pull on
>>>>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk
>>>>>>>>>>>>> about default
>>>>>>>>>>>>> or reference values here either, just a move to lower the
>>>>>>>>>>>>> range of
>>>>>>>>>>>>> options for whatever reason.
>>>>>>>>>>>>>> I don't know how much power you guys have over them, but
>>>>>>>>>>>>>> please
>>>>>>>>>>>>> consider either reverting this change, or give us an option
>>>>>>>>>>>>> to set
>>>>>>>>>>>>> min_cap through say /sys (right now param is readonly, even
>>>>>>>>>>>>> for root).
>>>>>>>>>>>>>> Thank you in advance for looking into this, with regards: 
>>>>>>>>>>>>>> Romano
>>>>>>>>>>>>> """
>>>>>>>>>>>>>
>>>>>>>>>>>>> And while at it, let me add this issue to the tracking as well
>>>>>>>>>>>>>
>>>>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux
>>>>>>>>>>>>> kernel
>>>>>>>>>>>>> regressions; the text you find below is based on a few
>>>>>>>>>>>>> templates
>>>>>>>>>>>>> paragraphs you might have encountered already in similar form.
>>>>>>>>>>>>> See link in footer if these mails annoy you.]
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall
>>>>>>>>>>>>> through the
>>>>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux
>>>>>>>>>>>>> kernel regression
>>>>>>>>>>>>> tracking bot:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #regzbot introduced 1958946858a62b /
>>>>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression
>>>>>>>>>>>>> tracker' hat)
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> Everything you wanna know about Linux kernel regression
>>>>>>>>>>>>> tracking:
>>>>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr
>>>>>>>>>>>>> That page also explains what to do if mails like this annoy
>>>>>>>>>>>>> you.
>>>
> 
> 
>