I actually *just* finished my bisect, and arrived at the same conclusion. The hang appears to be introduced in edad8312cbbf9a33c86873fc4093664f150dd5c1. There are some conflicts with an automatic `git revert`, so I'm picking through the changes now to fully understand what happened and come up with a fix. Thanks again for the help, Matt On 7/31/20 2:20 PM, Paweł Gronowski wrote: > Hello again, > > I just finished a bisect of amd-staging-drm-next and it looks like > the hang is first introduced in edad8312cbbf9a33c86873fc4093664f150dd5c1 > ("drm/amdgpu: fix system hang issue during GPU reset"). > > It is a bit tricky, because it is commited on top of my first faulty patch > 7173949df45482 ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") so > it needs to be reverted fix the premature -INVAL. > > Test case: > sudo sh -c 'echo "s 0 305 750" > /sys/class/drm/card0/device/pp_od_clk_voltage' > Results: > edad8312cbbf9a3 + revert 7173949df45482 = hang > edad8312cbbf9a3~1 + revert 7173949df45482 = no hang > > Could you confirm that you get the same results? > > Thanks, > Paweł Gronowski > > > On Fri, Jul 31, 2020 at 03:34:40PM +0200, Paweł Gronowski wrote: >> Hey Matt, >> >> I have just tested the amd-staging-drm-next branch >> (dd654c76d6e854afad716ded899e4404734aaa10) with my patches reverted >> and I can reproduce your issue with: >> >> sudo sh -c 'echo "s 0 305 750" > /sys/class/drm/card0/device/pp_od_clk_voltage' >> >> Which makes the sh hang with 100% usage. >> >> The issue does not happen on the mainline (d8b9faec54ae4bc2fff68bcd0befa93ace8256ce) >> both without and with my patches reapplied. >> So the problem must be related to some commit that is present in the >> amd-staging-drm-next but not in the mainline. >> >> >> Paweł Gronowski >> >> On Thu, Jul 30, 2020 at 06:34:14PM -0600, Matt Coffin wrote: >>> Hey Pawel, >>> >>> I did confirm that this patch *introduced* the issue both with the >>> bisect, and by testing reverting it. >>> >>> Now, there's a lot of fragile pieces in the dpm handling, so it could be >>> this patch's interaction with something else that's causing it and it >>> may well not be the fault of this code, but this is the patch that >>> introduced the issue. >>> >>> I'll have some more time tomorrow to try to get down to root cause here, >>> so maybe I'll have more to offer then. >>> >>> Thanks for taking a look, >>> Matt >>> >>> On 7/30/20 6:31 PM, Paweł Gronowski wrote: >>>> Hello Matt, >>>> >>>> Thank you for your testing. It seems that my gpu (RX 570) does not support the >>>> vc setting so I can not exactly reproduce the issue. However I did trace the >>>> code path the test case takes and it seems to correctly pass through the while >>>> loop that parses the input and fails only in amdgpu_dpm_odn_edit_dpm_table. >>>> The 'parameter' array is populated the same way as the original code did. Since >>>> the amdgpu_dpm_odn_edit_dpm_table is reached, I think that your problem is >>>> unfortunately caused by something else. >>>> >>>> >>>> Paweł Gronowski >>>> >>>> On Thu, Jul 30, 2020 at 08:49:41AM -0600, Matt Coffin wrote: >>>>> Hello all, I just did some testing with this applied, and while it no >>>>> longer returns -EINVAL, running `sudo sh -c 'echo "vc 2 2150 1195" > >>>>> /sys/class/drm/card1/device/pp_od_clk_voltage'` results in `sh` spiking >>>>> to, and staying at 100% CPU usage, with no indicating information in >>>>> `dmesg` from the kernel. >>>>> >>>>> It appeared to work at least ONCE, but potentially not after. >>>>> >>>>> This is not unique to Navi, and caused the problem on a POLARIS10 card >>>>> as well. >>>>> >>>>> Sorry for the bad news, and thanks for any insight you may have, >>>>> Matt Coffin >>>>> >>>>> On 7/29/20 8:53 PM, Alex Deucher wrote: >>>>>> On Wed, Jul 29, 2020 at 10:20 PM Paweł Gronowski <me@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> Regression was introduced in commit 38e0c89a19fd >>>>>>> ("drm/amdgpu: Fix NULL dereference in dpm sysfs handlers") which >>>>>>> made the set_pp_od_clk_voltage and set_pp_power_profile_mode return >>>>>>> -EINVAL for previously valid input. This was caused by an empty >>>>>>> string (starting at the \0 character) being passed to the kstrtol. >>>>>>> >>>>>>> Signed-off-by: Paweł Gronowski <me@xxxxxxxxxx> >>>>>> >>>>>> Applied. Thanks! >>>>>> >>>>>> Alex >>>>>> >>>>>>> --- >>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c | 9 +++++++-- >>>>>>> 1 file changed, 7 insertions(+), 2 deletions(-) >>>>>>> >>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c >>>>>>> index ebb8a28ff002..cbf623ff03bd 100644 >>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c >>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c >>>>>>> @@ -778,12 +778,14 @@ static ssize_t amdgpu_set_pp_od_clk_voltage(struct device *dev, >>>>>>> tmp_str++; >>>>>>> while (isspace(*++tmp_str)); >>>>>>> >>>>>>> - while ((sub_str = strsep(&tmp_str, delimiter)) != NULL) { >>>>>>> + while ((sub_str = strsep(&tmp_str, delimiter)) && *sub_str) { >>>>>>> ret = kstrtol(sub_str, 0, ¶meter[parameter_size]); >>>>>>> if (ret) >>>>>>> return -EINVAL; >>>>>>> parameter_size++; >>>>>>> >>>>>>> + if (!tmp_str) >>>>>>> + break; >>>>>>> while (isspace(*tmp_str)) >>>>>>> tmp_str++; >>>>>>> } >>>>>>> @@ -1635,11 +1637,14 @@ static ssize_t amdgpu_set_pp_power_profile_mode(struct device *dev, >>>>>>> i++; >>>>>>> memcpy(buf_cpy, buf, count-i); >>>>>>> tmp_str = buf_cpy; >>>>>>> - while ((sub_str = strsep(&tmp_str, delimiter)) != NULL) { >>>>>>> + while ((sub_str = strsep(&tmp_str, delimiter)) && *sub_str) { >>>>>>> ret = kstrtol(sub_str, 0, ¶meter[parameter_size]); >>>>>>> if (ret) >>>>>>> return -EINVAL; >>>>>>> parameter_size++; >>>>>>> + >>>>>>> + if (!tmp_str) >>>>>>> + break; >>>>>>> while (isspace(*tmp_str)) >>>>>>> tmp_str++; >>>>>>> } >>>>>>> -- >>>>>>> 2.25.1 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> amd-gfx mailing list >>>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> _______________________________________________ >>>>>> amd-gfx mailing list >>>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx >>>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx >>>>>> >>>>> >>>> >>>> >>>> >>> >> >> >>
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx