Hello again, I just finished a bisect of amd-staging-drm-next and it looks like the hang is first introduced in edad8312cbbf9a33c86873fc4093664f150dd5c1 ("drm/amdgpu: fix system hang issue during GPU reset"). It is a bit tricky, because it is commited on top of my first faulty patch 7173949df45482 ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") so it needs to be reverted fix the premature -INVAL. Test case: sudo sh -c 'echo "s 0 305 750" > /sys/class/drm/card0/device/pp_od_clk_voltage' Results: edad8312cbbf9a3 + revert 7173949df45482 = hang edad8312cbbf9a3~1 + revert 7173949df45482 = no hang Could you confirm that you get the same results? Thanks, Paweł Gronowski On Fri, Jul 31, 2020 at 03:34:40PM +0200, Paweł Gronowski wrote: > Hey Matt, > > I have just tested the amd-staging-drm-next branch > (dd654c76d6e854afad716ded899e4404734aaa10) with my patches reverted > and I can reproduce your issue with: > > sudo sh -c 'echo "s 0 305 750" > /sys/class/drm/card0/device/pp_od_clk_voltage' > > Which makes the sh hang with 100% usage. > > The issue does not happen on the mainline (d8b9faec54ae4bc2fff68bcd0befa93ace8256ce) > both without and with my patches reapplied. > So the problem must be related to some commit that is present in the > amd-staging-drm-next but not in the mainline. > > > Paweł Gronowski > > On Thu, Jul 30, 2020 at 06:34:14PM -0600, Matt Coffin wrote: > > Hey Pawel, > > > > I did confirm that this patch *introduced* the issue both with the > > bisect, and by testing reverting it. > > > > Now, there's a lot of fragile pieces in the dpm handling, so it could be > > this patch's interaction with something else that's causing it and it > > may well not be the fault of this code, but this is the patch that > > introduced the issue. > > > > I'll have some more time tomorrow to try to get down to root cause here, > > so maybe I'll have more to offer then. > > > > Thanks for taking a look, > > Matt > > > > On 7/30/20 6:31 PM, Paweł Gronowski wrote: > > > Hello Matt, > > > > > > Thank you for your testing. It seems that my gpu (RX 570) does not support the > > > vc setting so I can not exactly reproduce the issue. However I did trace the > > > code path the test case takes and it seems to correctly pass through the while > > > loop that parses the input and fails only in amdgpu_dpm_odn_edit_dpm_table. > > > The 'parameter' array is populated the same way as the original code did. Since > > > the amdgpu_dpm_odn_edit_dpm_table is reached, I think that your problem is > > > unfortunately caused by something else. > > > > > > > > > Paweł Gronowski > > > > > > On Thu, Jul 30, 2020 at 08:49:41AM -0600, Matt Coffin wrote: > > >> Hello all, I just did some testing with this applied, and while it no > > >> longer returns -EINVAL, running `sudo sh -c 'echo "vc 2 2150 1195" > > > >> /sys/class/drm/card1/device/pp_od_clk_voltage'` results in `sh` spiking > > >> to, and staying at 100% CPU usage, with no indicating information in > > >> `dmesg` from the kernel. > > >> > > >> It appeared to work at least ONCE, but potentially not after. > > >> > > >> This is not unique to Navi, and caused the problem on a POLARIS10 card > > >> as well. > > >> > > >> Sorry for the bad news, and thanks for any insight you may have, > > >> Matt Coffin > > >> > > >> On 7/29/20 8:53 PM, Alex Deucher wrote: > > >>> On Wed, Jul 29, 2020 at 10:20 PM Paweł Gronowski <me@xxxxxxxxxx> wrote: > > >>>> > > >>>> Regression was introduced in commit 38e0c89a19fd > > >>>> ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") which > > >>>> made the set_pp_od_clk_voltage and set_pp_power_profile_mode return > > >>>> -EINVAL for previously valid input. This was caused by an empty > > >>>> string (starting at the \0 character) being passed to the kstrtol. > > >>>> > > >>>> Signed-off-by: Paweł Gronowski <me@xxxxxxxxxx> > > >>> > > >>> Applied. Thanks! > > >>> > > >>> Alex > > >>> > > >>>> --- > > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 9 +++++++-- > > >>>> 1 file changed, 7 insertions(+), 2 deletions(-) > > >>>> > > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > > >>>> index ebb8a28ff002..cbf623ff03bd 100644 > > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c > > >>>> @@ -778,12 +778,14 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct device *dev, > > >>>> tmp_str++; > > >>>> while (isspace(*++tmp_str)); > > >>>> > > >>>> - while ((sub_str = strsep(&tmp_str, delimiter)) != NULL) { > > >>>> + while ((sub_str = strsep(&tmp_str, delimiter)) && *sub_str) { > > >>>> ret = kstrtol(sub_str, 0, ¶meter[parameter_size]); > > >>>> if (ret) > > >>>> return -EINVAL; > > >>>> parameter_size++; > > >>>> > > >>>> + if (!tmp_str) > > >>>> + break; > > >>>> while (isspace(*tmp_str)) > > >>>> tmp_str++; > > >>>> } > > >>>> @@ -1635,11 +1637,14 @@ static ssize_t amdgpu_set_pp_power_profile_mode(struct device *dev, > > >>>> i++; > > >>>> memcpy(buf_cpy, buf, count-i); > > >>>> tmp_str = buf_cpy; > > >>>> - while ((sub_str = strsep(&tmp_str, delimiter)) != NULL) { > > >>>> + while ((sub_str = strsep(&tmp_str, delimiter)) && *sub_str) { > > >>>> ret = kstrtol(sub_str, 0, ¶meter[parameter_size]); > > >>>> if (ret) > > >>>> return -EINVAL; > > >>>> parameter_size++; > > >>>> + > > >>>> + if (!tmp_str) > > >>>> + break; > > >>>> while (isspace(*tmp_str)) > > >>>> tmp_str++; > > >>>> } > > >>>> -- > > >>>> 2.25.1 > > >>>> > > >>>> _______________________________________________ > > >>>> amd-gfx mailing list > > >>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > >>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > >>> _______________________________________________ > > >>> amd-gfx mailing list > > >>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx > > >>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx > > >>> > > >> > > > > > > > > > > > > > > _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx