Re: [RFC PATCH v6 08/12] cxl/memscrub: Register CXL device ECS with scrub configure driver

Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx> · Tue, 20 Feb 2024 13:39:55 +0000

On Thu, 15 Feb 2024 19:14:50 +0800
<shiju.jose@xxxxxxxxxx> wrote:

> From: Shiju Jose <shiju.jose@xxxxxxxxxx>
> 
> Register with the scrub configure driver to expose the sysfs attributes
> to the user for configuring the CXL memory device's ECS feature.
> Add the static CXL ECS specific attributes to support configuring the
> CXL memory device ECS feature.
> 
> Signed-off-by: Shiju Jose <shiju.jose@xxxxxxxxxx>

The ABI in here needs documentation.  My key takeaway is that
it is very ECS specific.  I think one of the big challenges of a common scrub
control system is going to be trying to come up with some meaningful 
common ABI.

> ---
>  drivers/cxl/core/memscrub.c | 253 +++++++++++++++++++++++++++++++++++-
>  1 file changed, 250 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/core/memscrub.c b/drivers/cxl/core/memscrub.c
> index a1fb40f8307f..325084b22e7a 100644
> --- a/drivers/cxl/core/memscrub.c
> +++ b/drivers/cxl/core/memscrub.c
> @@ -464,6 +464,8 @@ EXPORT_SYMBOL_NS_GPL(cxl_mem_patrol_scrub_init, CXL);
>  #define CXL_MEMDEV_ECS_GET_FEAT_VERSION	0x01
>  #define CXL_MEMDEV_ECS_SET_FEAT_VERSION	0x01
>  
> +#define CXL_DDR5_ECS	"cxl_ecs"
I would just put these name defines inline.

> +enum cxl_mem_ecs_scrub_attributes {
> +	cxl_ecs_log_entry_type,
> +	cxl_ecs_log_entry_type_per_dram,
> +	cxl_ecs_log_entry_type_per_memory_media,
> +	cxl_ecs_mode,
> +	cxl_ecs_mode_counts_codewords,
> +	cxl_ecs_mode_counts_rows,
> +	cxl_ecs_reset,
> +	cxl_ecs_threshold,
> +	cxl_ecs_threshold_available,
> +	cxl_ecs_max_attrs,
This is pretty much all custom ABI.  Challenging to make it common with
the main scrub and RASF controls, but I think we do need to see if we can
come up with something that is at least vaguely consistent across
different forms of scrub control.

What the user cares about is how likely an error is to get past the
scrubbing that is running (I think - RAS folk speak up if I have
this wrong!)

So how do we go from the ECS parameters to that sort of info?
I think ECS is effectively scrubbing at a fixed rate (google suggests
all ram every 24 hours).  We are really controlling what info is
reported rather than what scrub is carried out.

Useful stuff to potentially control but different from the
other cases.

> +};

> +
>  int cxl_mem_ecs_init(struct cxl_memdev *cxlmd, int region_id)
>  {
> +	char scrub_name[CXL_MEMDEV_MAX_NAME_LENGTH];
>  	struct cxl_mbox_supp_feat_entry feat_entry;
>  	struct cxl_ecs_context *cxl_ecs_ctx;
> +	struct device *cxl_scrub_dev;

Make this more local as we don't need it out here?

>  	int nmedia_frus;
>  	int ret;
>  
> @@ -755,6 +993,15 @@ int cxl_mem_ecs_init(struct cxl_memdev *cxlmd, int region_id)
>  		cxl_ecs_ctx->get_feat_size = feat_entry.get_feat_size;
>  		cxl_ecs_ctx->set_feat_size = feat_entry.set_feat_size;
>  		cxl_ecs_ctx->region_id = region_id;
> +
> +		snprintf(scrub_name, sizeof(scrub_name), "%s_%s_region%d",
> +			 CXL_DDR5_ECS, dev_name(&cxlmd->dev), cxl_ecs_ctx->region_id);
> +		cxl_scrub_dev = devm_scrub_device_register(&cxlmd->dev, scrub_name,
> +							  cxl_ecs_ctx, NULL,
> +							  cxl_ecs_ctx->region_id,
> +							  &cxl_mem_ecs_attr_group);
> +		if (IS_ERR(cxl_scrub_dev))
> +			return PTR_ERR(cxl_scrub_dev);
>  	}
>  
>  	return 0;