On Thu, Mar 14, 2024 at 12:16 PM Khatri, Sunil <sukhatri@xxxxxxx> wrote: > > > On 3/14/2024 8:12 PM, Alex Deucher wrote: > > On Thu, Mar 14, 2024 at 1:44 AM Khatri, Sunil <sukhatri@xxxxxxx> wrote: > >> > >> On 3/14/2024 1:58 AM, Alex Deucher wrote: > >>> On Tue, Mar 12, 2024 at 8:41 AM Sunil Khatri <sunil.khatri@xxxxxxx> wrote: > >>>> Add all the IP's information on a SOC to the > >>>> devcoredump. > >>>> > >>>> Signed-off-by: Sunil Khatri <sunil.khatri@xxxxxxx> > >>>> --- > >>>> drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c | 19 +++++++++++++++++++ > >>>> 1 file changed, 19 insertions(+) > >>>> > >>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >>>> index a0dbccad2f53..611fdb90a1fc 100644 > >>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_reset.c > >>>> @@ -196,6 +196,25 @@ amdgpu_devcoredump_read(char *buffer, loff_t offset, size_t count, > >>>> coredump->reset_task_info.process_name, > >>>> coredump->reset_task_info.pid); > >>>> > >>>> + /* GPU IP's information of the SOC */ > >>>> + if (coredump->adev) { > >>>> + drm_printf(&p, "\nIP Information\n"); > >>>> + drm_printf(&p, "SOC Family: %d\n", coredump->adev->family); > >>>> + drm_printf(&p, "SOC Revision id: %d\n", coredump->adev->rev_id); > >>>> + > >>>> + for (int i = 0; i < coredump->adev->num_ip_blocks; i++) { > >>>> + struct amdgpu_ip_block *ip = > >>>> + &coredump->adev->ip_blocks[i]; > >>>> + drm_printf(&p, "IP type: %d IP name: %s\n", > >>>> + ip->version->type, > >>>> + ip->version->funcs->name); > >>>> + drm_printf(&p, "IP version: (%d,%d,%d)\n\n", > >>>> + ip->version->major, > >>>> + ip->version->minor, > >>>> + ip->version->rev); > >>>> + } > >>>> + } > >>> I think the IP discovery table would be more useful. Either walk the > >>> adev->ip_versions structure, or just include the IP discovery binary. > >> I did explore the adev->ip_versions and if i just go through the array > >> it doesn't give any useful information directly. > >> There are no ways to find directly from adev->ip_versions below things > >> until i also reparse the discovery binary again like done the discovery > >> amdgpu_discovery_reg_base_init and walk through the headers of various > >> ips using discovery binary. > >> a. Which IP is available on soc or not. > >> b. How many instances are there > >> Also i again have to change back to major, minor and rev convention for > >> this information to be useful. I am exploring it more if i find some > >> other information i will update. > >> > >> adev->ip_block[] is derived from ip discovery only for each block which > >> is there on the SOC, so we are not reading information which isnt > >> applicable for the soc. We have name , type and version no of the IPs > >> available on the soc. If you want i could add no of instances of each IP > >> too if you think that's useful information here. Could you share what > >> information is missing in this approach so i can include that. > > I was hoping to get the actual IP versions for the IPs from IP > > discovery rather than the versions from the ip_block array. The > > latter are common so you can end up with the same version used across > > a wide variety of chips (e.g., all gfx10.x based chips use the same > > gfx 10 IP code even if the actual IP version is different for most of > > the chips). > Got it. let me check how to get it could be done rightly. > > > >> For dumping the IP discovery binary, i dont understand how that > >> information would be useful directly and needs to be decoded like we are > >> doing in discovery init. Please correct me if my understanding is wrong > >> here. > > It's probably not a high priority, I was just thinking it might be > > useful to have in case there ended up being some problem related to > > the IP discovery table on some boards. E.g., we'd know that all > > boards with a certain harvest config seem to align with a reported > > problem. Similar for vbios. It's more for telemetry. E.g., all the > > boards reporting some problem have a particular powerplay config or > > whatever. > I got it. > But two points of contention here in my understanding. The dump works > only where there is reset and not sure if it could be used very early in > development of not. Second point is that devcoredump is 4096 > bytes/4Kbyte of memory where we are dumping all the information. Not > sure if that could be increased but it might not be enough if we are > planning to dump all to it. ah, ok. Let's skip the IP versions in that case, we can use the family and rev_id and external_rev_id to look up the IP versions. Alex > > Another point is since we have sysfs/debugfs/info ioctl etc information > available. We should sort out what really is helpful in debugging GPU > hang and that's added in devcore. > > Regards > Sunil > > > > > Alex > > > > > >>> Alex > >>> > >>>> + > >>>> if (coredump->ring) { > >>>> drm_printf(&p, "\nRing timed out details\n"); > >>>> drm_printf(&p, "IP Type: %d Ring Name: %s\n", > >>>> -- > >>>> 2.34.1 > >>>>