Agreed... one person's "best" is another person's "OMG I didn't want that". IMO we should have bits correspond to specific options as much as possible, modulo HW capabilities. >-----Original Message----- >From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of >Xie, AlexBin >Sent: Monday, June 26, 2017 2:12 PM >To: Panariti, David; Deucher, Alexander; amd-gfx at lists.freedesktop.org >Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use >of ECC/EDC. > >Hi, > >I have not checked the background of this discussion very closely yet. And you >might have known the following. > >Customers may not want the default setting to change meaning. This is like an >API. >Example: The application and its environment is already set up and tested. >Then if customer updates driver, suddenly driver has some new behavior? >Certain serious application definitely does not accept this. > >IMHO, it is better to avoid vague concepts like "best". It will become rather >difficult to define what is best when there are multiple customers with >different requirements. Driver is to provide a feature or mechanism. "Best" >sounds like a policy or a preference from driver side. > >In my pass work, I generally use default for two cases: >1. The default is the most conservative option, which must work. Then the >application can choose advanced features by choosing other parameter >value/option. >2. The default parameter is the compatible behavior before introducing this >parameter/option. > >Regards, >Alex Bin > > >-----Original Message----- >From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On Behalf Of >Panariti, David >Sent: Monday, June 26, 2017 12:06 PM >To: Deucher, Alexander <Alexander.Deucher at amd.com>; amd- >gfx at lists.freedesktop.org >Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control use >of ECC/EDC. > >>> > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by >>> > default. That can be our asic specific default setting. In the >>> > case of CZ, that will be disabled until we decide to enable EDC by default. >>> [davep] I'm confused. ECC...BEST will cause EDC to be enabled. >>> I used ECC as the generic term for ECC and EDC, since ECC seems more >>> basic (EDC is built on top of ECC). >>> If I understand you, we can't do what you want with the current setup. > >>I'm saying we make ECC_BEST the default config (feel free to re-name if >>ECC_DEFAULT). Each asic can have a different default depending on what >>features are ready. So for CZ, we'd make ECC_BEST equivalent to >>disabling ECC for now. If a user wants to force it on, they can set ECC_EDC. >Once EDC is stable on CZ, we can make ECC_BEST be equivalent to ECC_EDC. >The way the default (ECC_BEST) always maps to the best available >combination in that version of the driver. > >That's not how I meant it to work WRT BEST. >Each asic will have a DEFAULT, but that isn't what BEST means. >CZ is a good example (when fully implemented). DEFAULT for CZ is everything >except HALT, since, IMO opinion, most people do not want to hang or reboot. >BEST for CZ would be everything a person most interested in reliability would >want, which IMO, includes HALT/reboot. >Similar is if something like performance degradation is really bad, DEFAULT >would be OFF. BEST would be ON, e.g., if the user's app doesn't trigger the >performance problem. >The BEST bit is in a fixed position, so that customers don't need to worry what >bits are needed for the most reliable performance (in our opinion) on a given >asic. >And if a customer (or developer) wants some arbitrary set of features, they >can set bits as they want. > >I think DEFAULT will make most people happy. >BEST allows people who are interested in everything they can get, regardless >of any issues that brings with it. It is requested simply by using a fixed param >value (0x01) for any asic. >This probably should not include features that have any kind of fatal flaw such >as the Vega10 HBM ECC issue. When fixed, it can be added to DEFAULT. >And allowing per-feature control allows anyone to do precisely what they >want. >"Effort" increases as the number of interested users decreases. > >Using defines in the init code will be a problem if there is more than one kind >of asic involved or a single asic that the user wants to use with different >parameters. However, this doesn't seem to be a high priority. >If we do want to worry about it, then we'll need to put the values into the >amdgpu_gfx struct. > >regards, >davep > >> -----Original Message----- >> From: Deucher, Alexander >> Sent: Tuesday, June 06, 2017 6:16 PM >> To: Panariti, David <David.Panariti at amd.com>; amd- >> gfx at lists.freedesktop.org >> Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control >> use of ECC/EDC. >> >> > -----Original Message----- >> > From: Panariti, David >> > Sent: Tuesday, June 06, 2017 5:50 PM >> > To: Deucher, Alexander; amd-gfx at lists.freedesktop.org >> > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control >> > use of ECC/EDC. >> > >> > Rather than inlining this in a number of places, Re verbosity: >> > I've worked in embedded environments and when dealing with >> > intermittent problems it's nice to have all of the information ASAP >> > rather than waiting for a problem to reoccur, especially if it's >> > very >> intermittent. >> > I would've preferred more. >> > Since it only shows up happens on CZ, it adds little to the output. >> > I like to show the reasons why EDC didn't happen, hence the >> > backwards looking messages. >> > In this particular case, without the "... not requested..." we >> > can't tell if it was the flags or the ring being unready that made us bail >earlier. >> >> I'm fine with a message about EDC either being enabled or disabled, >> but a bunch of random debug statements along the way are too much. >> They tend to just cause confusion and clutter up the logs. >> >> > >> > > -----Original Message----- >> > > From: Deucher, Alexander >> > > Sent: Tuesday, June 06, 2017 5:22 PM >> > > To: Panariti, David <David.Panariti at amd.com>; amd- >> > > gfx at lists.freedesktop.org >> > > Cc: Panariti, David <David.Panariti at amd.com> >> > > Subject: RE: [PATCH 3/3] drm/amdgpu: Add kernel parameter to >> > > control >> > use >> > > of ECC/EDC. >> > > >> > > > -----Original Message----- >> > > > From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On >> > Behalf >> > > > Of David Panariti >> > > > Sent: Tuesday, June 06, 2017 4:33 PM >> > > > To: amd-gfx at lists.freedesktop.org >> > > > Cc: Panariti, David >> > > > Subject: [PATCH 3/3] drm/amdgpu: Add kernel parameter to control >> > > > use of ECC/EDC. >> > > > >> > > > Allow various kinds of memory integrity methods (e.g. ECC/EDC) >> > > > to be enabled or disabled. By default, all features are disabled. >> > > > >> > > > EDC is Error Detection and Correction. It can detect ECC errors >> > > > and do 0 or more of: count SEC (single error corrected) and DED >> > > > (double error detected, i.e. uncorrected ECC error), halt the >> > > > affected block, >> > > interrupt the CPU. >> > > > Currently, only counting errors is supported. >> > > > >> > > > Fixed a whitespace error. >> > > > >> > > > Signed-off-by: David Panariti <David.Panariti at amd.com> >> > > > --- >> > > > drivers/gpu/drm/amd/amdgpu/amdgpu.h | 1 + >> > > > drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 4 ++++ >> > > > drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c | 28 >> > > > ++++++++++++++++++++++++++-- >> > > > drivers/gpu/drm/amd/include/amd_shared.h | 14 ++++++++++++++ >> > > > 4 files changed, 45 insertions(+), 2 deletions(-) >> > > > >> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> > > > index a6f51eb..3e930ee 100644 >> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h >> > > > @@ -116,6 +116,7 @@ extern int amdgpu_cntl_sb_buf_per_se; >> > > > extern >> > int >> > > > amdgpu_param_buf_per_se; extern int amdgpu_job_hang_limit; >> > extern >> > > > int amdgpu_lbpw; >> > > > +extern unsigned amdgpu_ecc_flags; >> > > > >> > > > #define AMDGPU_DEFAULT_GTT_SIZE_MB 3072ULL /* >> > 3GB by >> > > > default */ >> > > > #define AMDGPU_WAIT_IDLE_TIMEOUT_IN_MS 3000 >> > > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> > > > b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> > > > index c4825ff..972660d 100644 >> > > > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> > > > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c >> > > > @@ -115,6 +115,7 @@ int amdgpu_cntl_sb_buf_per_se = 0; int >> > > > amdgpu_param_buf_per_se = 0; int amdgpu_job_hang_limit = 0; >> > > > int amdgpu_lbpw = -1; >> > > > +unsigned amdgpu_ecc_flags = 0; >> > > >> > > I'd suggest setting amdgpu_ecc_flags to AMD_ECC_SUPPORT_BEST by >> > > default. That can be our asic specific default setting. In the >> > > case of CZ, that will be disabled until we decide to enable EDC by default. >> > [davep] I'm confused. ECC...BEST will cause EDC to be enabled. >> > I used ECC as the generic term for ECC and EDC, since ECC seems more >> > basic (EDC is built on top of ECC). >> > If I understand you, we can't do what you want with the current setup. >> >> I'm saying we make ECC_BEST the default config (feel free to re-name >> if ECC_DEFAULT). Each asic can have a different default depending on >> what features are ready. So for CZ, we'd make ECC_BEST equivalent to >> disabling ECC for now. If a user wants to force it on, they can set >> ECC_EDC. Once EDC is stable on CZ, we can make ECC_BEST be equivalent >> to ECC_EDC. The way the default (ECC_BEST) always maps to the best >> available combination in that version of the driver. >> >> E.g., in the current gfx8 code: >> >> if ((amdgpu_ecc_flags & ECC_EDC) && edc_fuse_enabled) { >> // enable EDC >> goto enable_edc; >> } else { >> // disable EDC >> return 0; >> } >> >> Then if we want to enable EDC by default, we'd change the code like so: >> >> if (((amdgpu_ecc_flags == ECC_EDC) || (amdgpu_ecc_flags & ECC_EDC)) && >> edc_fuse_enabled) { >> // enable EDC >> goto enable_edc; >> } else { >> // disable EDC >> return 0; >> } >> >> That way we can have different ECC features enabled by default for >> each asic family without needing to specify command line options. >> >> Alex >> >> > >> > > >> > > > >> > > > MODULE_PARM_DESC(vramlimit, "Restrict VRAM for testing, in >> > > > megabytes"); module_param_named(vramlimit, amdgpu_vram_limit, >> > int, >> > > > 0600); @@ -795,6 +796,9 @@ static struct pci_driver >> > > > amdgpu_kms_pci_driver = { >> > > > .driver.pm = &amdgpu_pm_ops, >> > > > }; >> > > > >> > > > +MODULE_PARM_DESC(ecc_flags, "ECC/EDC enable flags (0 = disable >> > > > ECC/EDC (default))"); >> > > > +module_param_named(ecc_flags, amdgpu_ecc_flags, uint, 0444); >> > > > + >> > > > >> > > > >> > > > static int __init amdgpu_init(void) diff --git >> > > > a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> > > > b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> > > > index 46e766e..3b5685c 100644 >> > > > --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> > > > +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c >> > > > @@ -1824,7 +1824,17 @@ static int >> > > > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev) >> > > > if (dis_bit) { >> > > > /* On Carrizo, EDC may be permanently disabled by a >fuse. */ >> > > > DRM_INFO("CZ EDC hardware is disabled, >GC_EDC_CONFIG: >> > > > 0x%08x.\n", >> > > > - tmp); >> > > > + tmp); >> > > > + return 0; >> > > > + } >> > > > + >> > > > + /* >> > > > + * Check if EDC has been requested by a kernel parameter. >> > > > + * For Carrizo, EDC is the best/safest mode WRT error >handling. >> > > > + */ >> > > > + if (!(amdgpu_ecc_flags >> > > > + & (AMD_ECC_SUPPORT_BEST | >AMD_ECC_SUPPORT_EDC))) { >> > > > + DRM_INFO("EDC support has not been >requested.\n"); >> > > >> > > Can drop this debugging statement or make it DRM_DEBUG. >> > [davep] See top post. >> > > >> > > > return 0; >> > > > } >> > > > >> > > > @@ -1962,6 +1972,20 @@ static int >> > > > gfx_v8_0_do_edc_gpr_workarounds(struct amdgpu_device *adev) >> > > > goto fail; >> > > > } >> > > > >> > > > + /* 00 - GB_EDC_DED_MODE_LOG: Count DED errors but do >not halt >> > > > */ >> > > > + tmp = REG_SET_FIELD(tmp, GB_EDC_MODE, DED_MODE, 0); >> > > > + /* Do not propagate the errors to the next block. */ >> > > > + tmp = REG_SET_FIELD(tmp, GB_EDC_MODE, PROP_FED, 0); >> > > > + WREG32(mmGB_EDC_MODE, tmp); >> > > > + >> > > > + tmp = RREG32(mmCC_GC_EDC_CONFIG); >> > > > + >> > > > + /* >> > > > + * Clear EDC_DISABLE bit so the counters are available. >> > > > + */ >> > > > + tmp = REG_SET_FIELD(tmp, CC_GC_EDC_CONFIG, DIS_EDC, >0); >> > > > + WREG32(mmCC_GC_EDC_CONFIG, tmp); >> > > > + >> > > > gfx_v8_0_edc_clear_counters(adev); >> > > > >> > > > fail: >> > > > @@ -5548,7 +5572,7 @@ static int gfx_v8_0_pre_soft_reset(void >> > *handle) >> > > > gfx_v8_0_cp_compute_enable(adev, false); >> > > > } >> > > > >> > > > - return 0; >> > > > + return 0; >> > > > } >> > > > >> > > > static int gfx_v8_0_soft_reset(void *handle) diff --git >> > > > a/drivers/gpu/drm/amd/include/amd_shared.h >> > > > b/drivers/gpu/drm/amd/include/amd_shared.h >> > > > index 0f58e95..1cf30aa 100644 >> > > > --- a/drivers/gpu/drm/amd/include/amd_shared.h >> > > > +++ b/drivers/gpu/drm/amd/include/amd_shared.h >> > > > @@ -187,6 +187,20 @@ enum amd_fan_ctrl_mode { >> > > > #define AMD_PG_SUPPORT_GFX_QUICK_MG (1 << 11) >> > > > #define AMD_PG_SUPPORT_GFX_PIPELINE (1 << 12) >> > > > >> > > > +/* >> > > > + * ECC flags >> > > > + * Allows the user to choose what kind of error >> > > > +detection/correction is >> > > > used. >> > > > + * Currently, EDC is supported on Carrizo. >> > > > + * >> > > > + * The AMD_ECC_SUPPORT_BEST bit is used to allow a user to have >> > > > + the >> > > > driver >> > > > + * set what it thinks is best/safest mode. This may not be the >> > > > +same as the >> > > > + * default, depending on the GPU and the application. >> > > > + * Using a single bit makes it easy to request the best support >> > > > +without >> > > > + * needing to know all currently supported modes. >> > > > + */ >> > > > +#define AMD_ECC_SUPPORT_BEST (1 << 0) >> > > > +#define AMD_ECC_SUPPORT_EDC (1 << 1) >> > > > + >> > > > enum amd_pm_state_type { >> > > > /* not used for dpm */ >> > > > POWER_STATE_TYPE_DEFAULT, >> > > > -- >> > > > 2.7.4 >> > > > >> > > > _______________________________________________ >> > > > amd-gfx mailing list >> > > > amd-gfx at lists.freedesktop.org >> > > > https://lists.freedesktop.org/mailman/listinfo/amd-gfx >_______________________________________________ >amd-gfx mailing list >amd-gfx at lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/amd-gfx >_______________________________________________ >amd-gfx mailing list >amd-gfx at lists.freedesktop.org >https://lists.freedesktop.org/mailman/listinfo/amd-gfx