On 13/04/17 11:38 AM, Panariti, David wrote: > + Vilas > >> -----Original Message----- >> From: Deucher, Alexander >> Sent: Wednesday, April 12, 2017 9:29 PM >> To: 'Michel Dänzer' <michel at daenzer.net>; Panariti, David >> <David.Panariti at amd.com> >> Cc: amd-gfx at lists.freedesktop.org >> Subject: RE: [PATCH] drm/amdgpu: Add kernel parameter to manage >> memory error handling. >> >>> -----Original Message----- >>> From: Michel Dänzer [mailto:michel at daenzer.net] >>> Sent: Wednesday, April 12, 2017 9:17 PM >>> To: Panariti, David >>> Cc: Deucher, Alexander; amd-gfx at lists.freedesktop.org >>> Subject: Re: [PATCH] drm/amdgpu: Add kernel parameter to manage >> memory >>> error handling. >>> >>> On 13/04/17 02:38 AM, Panariti, David wrote: >>>>> From: Michel Dänzer [mailto:michel at daenzer.net] >>>>> >>>>>> @@ -212,6 +213,9 @@ module_param_named(cg_mask, >>>>> amdgpu_cg_mask, uint, >>>>>> 0444); MODULE_PARM_DESC(pg_mask, "Powergating flags mask (0 = >>>>> disable >>>>>> power gating)"); module_param_named(pg_mask, >> amdgpu_pg_mask, >>>>> uint, >>>>>> 0444); >>>>>> >>>>>> +MODULE_PARM_DESC(ecc_mask, "ECC/EDC flags mask (0 = disable >>>>>> +ECC/EDC)"); >>>>> >>>>> "0 = disable ECC/EDC" implies that they're enabled by default? Was >>>>> that already the case before this patch? >>>> >>>> [davep] Yes it was, and there was actually a problem in some cases >>>> where the CZ would hang which is why I added the param. I was >>>> wondering if it would be better to default to them being off, but I >>>> wasn't sure how important maintaining original behavior is >>>> considered. Actually, there are some bugs in the workaround function >>>> as it is, so it really should default to off. >>> >>> I agree. There have been some bug reports about Carrizo hangs, I >>> wonder if any of those might be related to this. >> >> Only the embedded SKUs support EDC. If they are embedded parts, it could >> be related. > > [davep] Sorry for the length, but I wanted all of the details out there for the most informed decision. > > Another thing is that they can go from not hanging to hanging for no discernable reason. > The KIQ changes, however, have seemed to have fixed it. For one chip and a few tens of reboots. > > There is also the issue of improperly initialized *gpr registers. > From the doc: > "Due to a hardware condition whereby some shader instructions utilize uninitialized SGPRs and/or VGPRs, the S/VPGR memories must be initialized prior to EDC operation, as, not doing so will cause erroneous counts to show up in the EDC counters." > I seem to recall Vilas saying it is the poison that isn't reset properly. But I'm not sure about the actual register contents. > Vilas? > > I suggest, at a minimum, checking for cz *and* the EDC fuse. If so, explicitly disable EDC, run the shaders to zero the *gprs, leave EDC disabled, and merge in the existing new code to zero all of the counters. > All of the code exists, in one place or another. > > The parameter to enable/disable probably won't be needed until EDC is fully implemented. > However, EDC can be enabled in a way that simply allows it to count errors. This has never caused a hang for me. > The counts are useful for reliability research. This was one of the goals of the original EDC task. > A umr script could be written (by interested parties) to read the counters and in fact enable the EDC counters. > I think this should be done if anyone is interested in the numbers. > Vilas? Any R&R work left in this area? Do you think customers would be interested in doing this on their own? For the special keeners we could add EDC counters to umr's --top and then in theory it'll be included in the log output. If you can send me info on how to enable/read the counters I can take a look at it. Cheers, Tom