On 20.02.24 21:18, Alex Deucher wrote: > On Tue, Feb 20, 2024 at 2:41 PM Romano <romaniox@xxxxxxxxx> wrote: >> >> If the increased low range is allowed via boot option, like in proposed >> patch, user clearly made an intentional decision. Undefined, but won't >> fry his hardware for sure. Undefined is also overclocking in that >> matter. You can go out of range with ratio of voltage vs frequency(still >> within vendor's limits) for example and crash the system. > > This whole thing reminds me of this: > https://xkcd.com/1172/ > The problem is another module parameter is another interface to > maintain and validate. Yup, of course, all that is understood. But we have this "no regressions" rule for a reason. Adhering to it strictly would afaics be counter-productive in this situation, but give users some way to manually do what was possible before out-of-the box IMHO is the minimum we should do. Maybe just allow that parameter only up to a certain recent GPU generation, that way you won't have to deal with that at some point in the future. > Moreover, we've had a number of cases in the > past where users have under or overclocked and reported bugs or > stability issues and it did not come to light that they were doing > that until we'd already spent a good deal of time trying to debug the > issue. Taint the kernel when that module parameter is used? We iirc have a taint bit exactly for this sort of situation. Sure, such reports will still happen, but then you at least have an indicator to spot them. Ciao, Thorsten > This obviously can still happen if you allow any sort of over > or underclocking, but at least if you stick to the limits you are > staying within the bounding box of the design. > > Alex > >> On 2/20/24 19:09, Alex Deucher wrote: >>> On Tue, Feb 20, 2024 at 11:46 AM Romano <romaniox@xxxxxxxxx> wrote: >>>> For Windows, apps like MSI Afterburner is the one to try and what most >>>> people go for. Using it in the past myself, I would be surprised if it >>>> adhered to such a high min power cap. But even if it did, why would we >>>> have to. >>>> >>>> Relying on vendors cap in this case has already proven wrong because >>>> things worked for quite some time already and people reported saving >>>> significant amount of watts, in my case 90W(!) for <10% perf. >>>> >>>> Therefore this talk about safety seems rather strange to me and >>>> especially so when we are talking about min_cap. Or name me a single >>>> case where someone fried his card due to "too low power" set in said >>>> variable. Now there was a report, where by going way too low, driver >>>> goes opposite into max power. That's it. That can be easily >>>> detected(vents going crazy etc.) and reverted. It is a max_cap that >>>> protect HW(also above scenario), not a min_cap. Feel free to adhere to >>>> safety standards with that one. >>> Because operation outside of the design bounding box is undefined. It >>> might work for some boards but not others. It's possible some of the >>> logic in the firmware or some of the components used on the board may >>> not work correctly below a certain limit, or the voltage regulators >>> used on a specific board have a minimum requirement that would not be >>> an issue if you stick the bounding box. >>> >>> Alex >>> >>>> As for solution, what some suggested already exist - a patch posted by >>>> fililip on gitlab is probably the way most of you would agree. It >>>> introduce a variable that can be set during boot to override min_cap. >>>> But he did not pull requested it, so please, if any one of you who have >>>> access to code and merge kernel would be kind enough to implement it. >>>> >>>> >>>> >>>> On 2/20/24 16:46, Alex Deucher wrote: >>>>> On Tue, Feb 20, 2024 at 10:42 AM Linux regression tracking (Thorsten >>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote: >>>>>> >>>>>> On 20.02.24 16:27, Hans de Goede wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On 2/20/24 16:15, Alex Deucher wrote: >>>>>>>> On Tue, Feb 20, 2024 at 10:03 AM Linux regression tracking (Thorsten >>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote: >>>>>>>>> On 20.02.24 15:45, Alex Deucher wrote: >>>>>>>>>> On Mon, Feb 19, 2024 at 9:47 AM Linux regression tracking (Thorsten >>>>>>>>>> Leemhuis) <regressions@xxxxxxxxxxxxx> wrote: >>>>>>>>>>> On 17.02.24 14:30, Greg KH wrote: >>>>>>>>>>>> On Sat, Feb 17, 2024 at 02:01:54PM +0100, Roman Benes wrote: >>>>>>>>>>>>> Minimum power limit on latest(6.7+) kernels is 190W for my GPU (RX 6700XT, >>>>>>>>>>>>> mesa, archlinux) and I cannot get power cap as low as before(to 115W), >>>>>>>>>>>>> neither with Corectrl, LACT or TuxClocker and /sys have a variable read-only >>>>>>>>>>>>> even for root. This is not of above apps issue but of the kernel, I read >>>>>>>>>>>>> similar issues from other bug reports of above apps. I downgraded to v6.6.10 >>>>>>>>>>>>> kernel and my 115W(under power)cap work again as before. >>>>>>>>>>> For the record and everyone that lands here: the cause is known now >>>>>>>>>>> (it's 1958946858a62b ("drm/amd/pm: Support for getting power1_cap_min >>>>>>>>>>> value") [v6.7-rc1]) and the issue afaics tracked here: >>>>>>>>>>> >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183 >>>>>>>>>>> >>>>>>>>>>> Other mentions: >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3137 >>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/2992 >>>>>>>>>>> >>>>>>>>>>> Haven't seen any statement from the amdgpu developers (now CCed) yet on >>>>>>>>>>> this there (but might have missed something!). From what I can see I >>>>>>>>>>> assume this will likely be somewhat tricky to handle, as a revert >>>>>>>>>>> overall might be a bad idea here. We'll see I guess. >>>>>>>>>> The change aligns the driver what has been validated on each board >>>>>>>>>> design. Windows uses the same limits. Using values lower than the >>>>>>>>>> validated range can lead to undefined behavior and could potentially >>>>>>>>>> damage your hardware. >>>>>>>>> Thx for the reply! Yeah, I was expecting something along those lines. >>>>>>>>> >>>>>>>>> Nevertheless it afaics still is a regression in the eyes of many users. >>>>>>>>> I'm not sure how Linus feels about this, but I wonder if we can find >>>>>>>>> some solution here so that users that really want to, can continue to do >>>>>>>>> what was possible out-of-the box before. Is that possible to realize or >>>>>>>>> even supported already? >>>>>>>>> >>>>>>>>> And sure, those users would be running their hardware outside of its >>>>>>>>> specifications. But is that different from overclocking (which the >>>>>>>>> driver allows, doesn't it? If not by all means please correct me!)? >>>>>>>> Sure. The driver has always had upper bound limits for overclocking, >>>>>>>> this change adds lower bounds checking for underclocking as well. >>>>>>>> When the silicon validation teams set the bounding box for a device, >>>>>>>> they set a range of values where it's reasonable to operate based on >>>>>>>> the characteristics of the design. >>>>>>>> >>>>>>>> If we did want to allow extended underclocking, we need a big warning >>>>>>>> in the logs at the very least. >>>>>>> Requiring a module-option to be set to allow this, as well as a big >>>>>>> warning in the logs sounds like a good solution to me. >>>>>> Yeah, especially as it sounds from some of the reports as if some >>>>>> vendors did a really bad job when it came to setting the proper >>>>>> lower-bound limits are now adhered -- and thus higher then what we used >>>>>> out-of-the box before 1958946858a62b was applied. >>>>>> >>>>>> Side note: I assume those "lower bounds checking" is done round about >>>>>> the same way by the Windows driver? Does that one allow users to go >>>>>> lower somehow? Say after modifying the registry or something like that? >>>>>> Or through external tools? >>>>> Windows uses the same limit. I'm not aware of any way to override the >>>>> limit on windows off hand. >>>>> >>>>> Alex >>>>> >>>>> >>>>>> Ciao, Thorsten >>>>>> >>>>>>>>>>> Roman posted something that apparently was meant to go to the list, so >>>>>>>>>>> let me put it here: >>>>>>>>>>> >>>>>>>>>>> """ >>>>>>>>>>> UPDATE: User fililip already posted patch, but it need to be merged, >>>>>>>>>>> discussion is on gitlab link below. >>>>>>>>>>> >>>>>>>>>>> (PS: I hope I am replying correctly to "all" now? - using original addr.) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> it seems that commit was already found(see user's 'fililip' comment): >>>>>>>>>>>> >>>>>>>>>>>> https://gitlab.freedesktop.org/drm/amd/-/issues/3183 >>>>>>>>>>>> commit 1958946858a62b6b5392ed075aa219d199bcae39 >>>>>>>>>>>> Author: Ma Jun <Jun.Ma2@xxxxxxx> >>>>>>>>>>>> Date: Thu Oct 12 09:33:45 2023 +0800 >>>>>>>>>>>> >>>>>>>>>>>> drm/amd/pm: Support for getting power1_cap_min value >>>>>>>>>>>> >>>>>>>>>>>> Support for getting power1_cap_min value on smu13 and smu11. >>>>>>>>>>>> For other Asics, we still use 0 as the default value. >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: Ma Jun <Jun.Ma2@xxxxxxx> >>>>>>>>>>>> Reviewed-by: Kenneth Feng <kenneth.feng@xxxxxxx> >>>>>>>>>>>> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx> >>>>>>>>>>>> >>>>>>>>>>>> However, this is not good as it remove under-powering range too far. I >>>>>>>>>>> was getting only about 7% less performance but 90W(!) less consumption >>>>>>>>>>> when set to my 115W before. Also I wonder if we as a OS of options and >>>>>>>>>>> freedom have to stick to such very high reference for min values without >>>>>>>>>>> ability to override them through some sys ctrls. Commit was done by amd >>>>>>>>>>> guy and I wonder if because of maybe this post that I made few months >>>>>>>>>>> ago(business strategy?): >>>>>>>>>>> https://www.reddit.com/r/Amd/comments/183gye7/rx_6700xt_from_230w_to_capped_115w_at_only_10/ >>>>>>>>>>>> This is not a dangerous OC upwards where I can understand desire to >>>>>>>>>>> protect HW, it is downward, having min cap at 190W when card pull on >>>>>>>>>>> 115W almost same speed is IMO crazy to deny. We don't talk about default >>>>>>>>>>> or reference values here either, just a move to lower the range of >>>>>>>>>>> options for whatever reason. >>>>>>>>>>>> I don't know how much power you guys have over them, but please >>>>>>>>>>> consider either reverting this change, or give us an option to set >>>>>>>>>>> min_cap through say /sys (right now param is readonly, even for root). >>>>>>>>>>>> Thank you in advance for looking into this, with regards: Romano >>>>>>>>>>> """ >>>>>>>>>>> >>>>>>>>>>> And while at it, let me add this issue to the tracking as well >>>>>>>>>>> >>>>>>>>>>> [TLDR: I'm adding this report to the list of tracked Linux kernel >>>>>>>>>>> regressions; the text you find below is based on a few templates >>>>>>>>>>> paragraphs you might have encountered already in similar form. >>>>>>>>>>> See link in footer if these mails annoy you.] >>>>>>>>>>> >>>>>>>>>>> Thanks for the report. To be sure the issue doesn't fall through the >>>>>>>>>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression >>>>>>>>>>> tracking bot: >>>>>>>>>>> >>>>>>>>>>> #regzbot introduced 1958946858a62b / >>>>>>>>>>> #regzbot title drm: amdgpu: under-powering broke >>>>>>>>>>> >>>>>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>>>>>>>>> -- >>>>>>>>>>> Everything you wanna know about Linux kernel regression tracking: >>>>>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr >>>>>>>>>>> That page also explains what to do if mails like this annoy you. >>>>>>> > >