>-----Original Message----- >From: Dan Williams <dan.j.williams@xxxxxxxxx> >Sent: 27 January 2025 23:17 >To: Shiju Jose <shiju.jose@xxxxxxxxxx>; Dan Williams ><dan.j.williams@xxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; linux- >cxl@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; linux- >kernel@xxxxxxxxxxxxxxx >Cc: bp@xxxxxxxxx; tony.luck@xxxxxxxxx; rafael@xxxxxxxxxx; lenb@xxxxxxxxxx; >mchehab@xxxxxxxxxx; dave@xxxxxxxxxxxx; Jonathan Cameron ><jonathan.cameron@xxxxxxxxxx>; dave.jiang@xxxxxxxxx; >alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx; >david@xxxxxxxxxx; Vilas.Sridharan@xxxxxxx; leo.duran@xxxxxxx; >Yazen.Ghannam@xxxxxxx; rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx; >Jon.Grimm@xxxxxxx; dave.hansen@xxxxxxxxxxxxxxx; >naoya.horiguchi@xxxxxxx; james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; >somasundaram.a@xxxxxxx; erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; >duenwen@xxxxxxxxxx; gthelen@xxxxxxxxxx; >wschwartz@xxxxxxxxxxxxxxxxxxx; dferguson@xxxxxxxxxxxxxxxxxxx; >wbs@xxxxxxxxxxxxxxxxxxxxxx; nifan.cxl@xxxxxxxxx; tanxiaofei ><tanxiaofei@xxxxxxxxxx>; Zengtao (B) <prime.zeng@xxxxxxxxxxxxx>; Roberto >Sassu <roberto.sassu@xxxxxxxxxx>; kangkang.shen@xxxxxxxxxxxxx; >wanghuiqiang <wanghuiqiang@xxxxxxxxxx>; Linuxarm ><linuxarm@xxxxxxxxxx> >Subject: RE: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol >scrub control feature > >Shiju Jose wrote: >> Hi Dan, >> >> Thanks for the comments. >> >> Please find reply inline. >> >> Thanks, >> Shiju >> >-----Original Message----- >> >From: Dan Williams <dan.j.williams@xxxxxxxxx> >> >Sent: 24 January 2025 20:39 >> >To: Shiju Jose <shiju.jose@xxxxxxxxxx>; linux-edac@xxxxxxxxxxxxxxx; >> >linux- cxl@xxxxxxxxxxxxxxx; linux-acpi@xxxxxxxxxxxxxxx; >> >linux-mm@xxxxxxxxx; linux- kernel@xxxxxxxxxxxxxxx >> >Cc: bp@xxxxxxxxx; tony.luck@xxxxxxxxx; rafael@xxxxxxxxxx; >> >lenb@xxxxxxxxxx; mchehab@xxxxxxxxxx; dan.j.williams@xxxxxxxxx; >> >dave@xxxxxxxxxxxx; Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>; >> >dave.jiang@xxxxxxxxx; alison.schofield@xxxxxxxxx; >> >vishal.l.verma@xxxxxxxxx; ira.weiny@xxxxxxxxx; david@xxxxxxxxxx; >> >Vilas.Sridharan@xxxxxxx; leo.duran@xxxxxxx; >Yazen.Ghannam@xxxxxxx; >> >rientjes@xxxxxxxxxx; jiaqiyan@xxxxxxxxxx; Jon.Grimm@xxxxxxx; >> >dave.hansen@xxxxxxxxxxxxxxx; naoya.horiguchi@xxxxxxx; >> >james.morse@xxxxxxx; jthoughton@xxxxxxxxxx; >somasundaram.a@xxxxxxx; >> >erdemaktas@xxxxxxxxxx; pgonda@xxxxxxxxxx; duenwen@xxxxxxxxxx; >> >gthelen@xxxxxxxxxx; wschwartz@xxxxxxxxxxxxxxxxxxx; >> >dferguson@xxxxxxxxxxxxxxxxxxx; wbs@xxxxxxxxxxxxxxxxxxxxxx; >> >nifan.cxl@xxxxxxxxx; tanxiaofei <tanxiaofei@xxxxxxxxxx>; Zengtao (B) >> ><prime.zeng@xxxxxxxxxxxxx>; Roberto Sassu <roberto.sassu@xxxxxxxxxx>; >> >kangkang.shen@xxxxxxxxxxxxx; wanghuiqiang ><wanghuiqiang@xxxxxxxxxx>; >> >Linuxarm <linuxarm@xxxxxxxxxx>; Shiju Jose <shiju.jose@xxxxxxxxxx> >> >Subject: Re: [PATCH v18 15/19] cxl/memfeature: Add CXL memory device >> >patrol scrub control feature >> > >> >shiju.jose@ wrote: >> >> From: Shiju Jose <shiju.jose@xxxxxxxxxx> >> >> >> >> CXL spec 3.1 section 8.2.9.9.11.1 describes the device patrol scrub >> >> control feature. The device patrol scrub proactively locates and >> >> makes corrections to errors in regular cycle. >> >> >> >> Allow specifying the number of hours within which the patrol scrub >> >> must be completed, subject to minimum and maximum limits reported >> >> by the >> >device. >> >> Also allow disabling scrub allowing trade-off error rates against >> >> performance. >> >> >> >> Add support for patrol scrub control on CXL memory devices. >> >> Register with the EDAC device driver, which retrieves the scrub >> >> attribute descriptors from EDAC scrub and exposes the sysfs scrub >> >> control attributes to userspace. For example, scrub control for the >> >> CXL memory device "cxl_mem0" is exposed in >> >/sys/bus/edac/devices/cxl_mem0/scrubX/. >> >> >> >> Additionally, add support for region-based CXL memory patrol scrub >control. >> >> CXL memory regions may be interleaved across one or more CXL memory >> >> devices. For example, region-based scrub control for "cxl_region1" >> >> is exposed in /sys/bus/edac/devices/cxl_region1/scrubX/. >> >> >> >> Reviewed-by: Dave Jiang <dave.jiang@xxxxxxxxx> >> >> Co-developed-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> >> >> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> >> >> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx> >> >> --- >> >> Documentation/edac/scrub.rst | 66 ++++++ >> >> drivers/cxl/Kconfig | 17 ++ >> >> drivers/cxl/core/Makefile | 1 + >> >> drivers/cxl/core/memfeature.c | 392 >> >++++++++++++++++++++++++++++++++++ >> >> drivers/cxl/core/region.c | 6 + >> >> drivers/cxl/cxlmem.h | 7 + >> >> drivers/cxl/mem.c | 5 + >> >> include/cxl/features.h | 16 ++ >> >> 8 files changed, 510 insertions(+) create mode 100644 >> >> drivers/cxl/core/memfeature.c diff --git >> >> a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst index >> >> f86645c7f0af..80e986c57885 100644 >> >> --- a/Documentation/edac/scrub.rst >> >> +++ b/Documentation/edac/scrub.rst >[..] >> > >> >What is this content-free blob of cat and echo statements? Please >> >write actual documentation with theory of operation, clarification of >> >assumptions, rationale for defaults, guidance on changing defaults... >> >> Jonathan already replied. > >I disagree that any of that is useful to include without rationale, and if the >rationale is already somewhere else then delete the multiple lines of showing >how 'cat' and 'echo' work with sysfs. I will discuss with Jonathan on this how to modify. > >[..] >> >> + depends on CXL_MEM >> > >> >Similar comment, and this also goes away if all of this just moves >> >into the new cxl_features driver. >> >> Agree with Jonathan told in reply. These are RAS specific features >> for CXL memory devices and thus added in memfeature.c > >Apoligies for this comment, I had meant to delete it along with some other >commentary along this theme after thinking it through. > >I am now advocating that Dave drop his cxl_features driver altogether and >mirror your approach. I.e. EDAC is registered from existing CXL drivers, and >FWCTL can be registered against a cxl_memdev just like the fw_upload ABI. > >There was a concern that CXL needed a separate FWCTL driver in case >distributions wanted to have a policy against FWCTL, but given CXL already has >CONFIG_CXL_MEM_RAW_COMMANDS at compile-time and a wide array of CXL >bus devices, a cxl_features device is an awkward fit. Ok. > >[..] >> >> +static int cxl_ps_get_attrs(struct cxl_patrol_scrub_context *cxl_ps_ctx, >> >> + struct cxl_memdev_ps_params *params) { >> >> + struct cxl_memdev *cxlmd; >> >> + u16 min_scrub_cycle = 0; >> >> + int i, ret; >> >> + >> >> + if (cxl_ps_ctx->cxlr) { >> >> + struct cxl_region *cxlr = cxl_ps_ctx->cxlr; >> >> + struct cxl_region_params *p = &cxlr->params; >> >> + >> >> + for (i = p->interleave_ways - 1; i >= 0; i--) { >> >> + struct cxl_endpoint_decoder *cxled = p->targets[i]; >> > >> >It looks like this is called directly as a callback from EDAC. Where >> >is the locking that keeps cxl_ps_ctx->cxlr valid, or p->targets content stable? >> Jonathan already replied. > >I could not find that comment? I *think* it's ok because when the region is in the >probe state changes will not be made to this list, but it would be useful to at >least have commentary to that effect. Protect against someone copying this >code in isolation and not consider the context. Sure. Will do. > >[..] >> >> + >> >> +int cxl_mem_ras_features_init(struct cxl_memdev *cxlmd, struct >> >> +cxl_region *cxlr) >> > >> >Please separate this into a memdev helper and a region helper. It is >> >silly to have two arguments to a function where one is expected to be >> >NULL at all times, and then have an if else statement inside that to >> >effectively turn it back into 2 code paths. >> > >> >If there is code to be shared amongst those, make *that* the shared helper. >> I added single function cxl_mem_ras_features_init() for both memdev >> and region based scrubbing to reduce code size as there were feedbacks try >reduce code size. > >"Succint" and "concise" does not necessarily mean less lines. I would greatly >prefer a few more lines if it mines not outsourcing complexity to the calling >context. Readable code means I do not need to wonder >what: > > cxl_mem_ras_features_init(NULL, cxlr) > >...means. I can just read devm_cxl_region_edac_register(cxlr), and know exactly >what is happening without needing to lose my train of thought to go read what >semantics cxl_mem_ras_features_init() is implementing. > >Note that all the other _init() calls in drivers/cxl/ (outside of module_init >callbacks), are just purely init work, not object registration. Please keep that >local style. Sure. Will add separate functions for region based edac registration. > >> >> +{ >> >> + struct edac_dev_feature ras_features[CXL_DEV_NUM_RAS_FEATURES]; >> >> + char cxl_dev_name[CXL_DEV_NAME_LEN]; >> >> + int num_ras_features = 0; >> >> + u8 scrub_inst = 0; >> >> + int rc; >> >> + >> >> + rc = cxl_memdev_scrub_init(cxlmd, cxlr, >> >&ras_features[num_ras_features], >> >> + scrub_inst); >> >> + if (rc < 0) >> >> + return rc; >> >> + >> >> + scrub_inst++; >> >> + num_ras_features++; >> >> + >> >> + if (cxlr) >> >> + snprintf(cxl_dev_name, sizeof(cxl_dev_name), >> >> + "cxl_region%d", cxlr->id); >> > >> >Why not pass dev_name(&cxlr->dev) directly? >> Jonathan already replied. > >That was purely with the cxl_mem observation, cxlr can be passed directly. Will check. > >> > >> >> + else >> >> + snprintf(cxl_dev_name, sizeof(cxl_dev_name), >> >> + "%s_%s", "cxl", dev_name(&cxlmd->dev)); >> > >> >Can a "cxl" directory be created so that the raw name can be used? > >In fact we already do something similar for CONFIG_HMEM_REPORTING (i.e. >an "access%d" device to create a nameed directory of attributes) so it is a >question for Boris if he wants to tolerate a parent "cxl" device to parent all CXL >objects in EDAC. > >> > >> >> + >> >> + return edac_dev_register(&cxlmd->dev, cxl_dev_name, NULL, >> >> + num_ras_features, ras_features); >> > >> >I'm so confused... a few lines down in this patch we have: >> > >> > rc = cxl_mem_ras_features_init(NULL, cxlr); >> > >> >...so how can this call to edac_dev_register() unconditionally >> >de-reference @cxlmd? >> Thanks for spotting this. It is a bug, need to fix. > > >[..] >> >> +EXPORT_SYMBOL_NS_GPL(cxl_mem_ras_features_init, "CXL"); >> >> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c >> >> index b98b1ccffd1c..c2be70cd87f8 100644 >> >> --- a/drivers/cxl/core/region.c >> >> +++ b/drivers/cxl/core/region.c >> >> @@ -3449,6 +3449,12 @@ static int cxl_region_probe(struct device *dev) >> >> p->res->start, p->res->end, cxlr, >> >> is_system_ram) > 0) >> >> return 0; >> >> + >> >> + rc = cxl_mem_ras_features_init(NULL, cxlr); >> >> + if (rc) >> >> + dev_warn(&cxlr->dev, "CXL RAS features init for >> >region_id=%d failed\n", >> >> + cxlr->id); >> > >> >There is more to RAS than EDAC memory scrub so this message is >> >misleading. It is also unnecessary because the driver continues to >> >load and the admin, if they care, will notice that the EDAC attributes are >missing. >> This message was added for the debugging purpose in CXL driver. I will change >to dev_dbg(). > >...but also stop calling this functionality with the blanket term "RAS". >It is "EDAC scrub and repair extensions to all the other RAS functionality the CXL >subsystem handles directly", name it accordingly. Sure. Will change. Thanks, Shiju