Re: [PATCH v3 02/14] cxl/region: Introduce concept of region configuration

Dan Williams <dan.j.williams@xxxxxxxxx> · Fri, 28 Jan 2022 16:25:34 -0800

On Thu, Jan 27, 2022 at 4:27 PM Ben Widawsky <ben.widawsky@xxxxxxxxx> wrote:
>
> The region creation APIs create a vacant region. Configuring the region
> works in the same way as similar subsystems such as devdax. Sysfs attrs
> will be provided to allow userspace to configure the region.  Finally
> once all configuration is complete, userspace may activate the region.
>
> Introduced here are the most basic attributes needed to configure a
> region. Details of these attribute are described in the ABI

s/attribute/attributes/

> Documentation. Sanity checking of configuration parameters are done at
> region binding time. This consolidates all such logic in one place,
> rather than being strewn across multiple places.

I think that's too late for some of the validation. The complex
validation that the region driver does throughout the topology is
different from the basic input validation that can  be done at the
sysfs write time. For example ,this patch allows negative
interleave_granularity values to specified, just return -EINVAL. I
agree that sysfs should not validate everything, I disagree with
pushing all validation to cxl_region_probe().

>
> A example is provided below:
>
> /sys/bus/cxl/devices/region0.0:0
> ├── interleave_granularity
> ├── interleave_ways
> ├── offset
> ├── size
> ├── subsystem -> ../../../../../../bus/cxl
> ├── target0
> ├── uevent
> └── uuid

As mentioned off-list, it looks like devtype and modalias are missing.

>
> Reported-by: kernel test robot <lkp@xxxxxxxxx> (v2)
> Signed-off-by: Ben Widawsky <ben.widawsky@xxxxxxxxx>
> ---
>  Documentation/ABI/testing/sysfs-bus-cxl |  40 ++++
>  drivers/cxl/core/region.c               | 300 ++++++++++++++++++++++++
>  2 files changed, 340 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-cxl b/Documentation/ABI/testing/sysfs-bus-cxl
> index dcc728458936..50ba5018014d 100644
> --- a/Documentation/ABI/testing/sysfs-bus-cxl
> +++ b/Documentation/ABI/testing/sysfs-bus-cxl
> @@ -187,3 +187,43 @@ Description:
>                 region driver before being deleted. The attributes expects a
>                 region in the form "regionX.Y:Z". The region's name, allocated
>                 by reading create_region, will also be released.
> +
> +What:          /sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/offset

This is just another 'resource' attribute for the physical base
address of the region, right? 'offset' sounds like something that
would be relative instead of absolute.

> +Date:          August, 2021

Same date update comment here.

> +KernelVersion: v5.18
> +Contact:       linux-cxl@xxxxxxxxxxxxxxx
> +Description:
> +               (RO) A region resides within an address space that is claimed by
> +               a decoder.

"A region is a contiguous partition of a CXL Root decoder address space."

>                  Region space allocation is handled by the driver, but

"Region capacity is allocated by writing to the size attribute, the
resulting physical address base determined by the driver is reflected
here."

> +               the offset may be read by userspace tooling in order to
> +               determine fragmentation, and available size for new regions.

I would also expect, before / along with these new region attributes,
there would be 'available' and 'max_extent_available' at the decoder
level to indicate how much free space the decoder has and how big the
next region creation can be. User tooling can walk  the decoder and
the regions together to determine fragmentation if necessary, but for
the most part the tool likely only cares about "how big can the next
region be?" and "how full is this decoder?".

> +
> +What:
> +/sys/bus/cxl/devices/decoderX.Y/regionX.Y:Z/{interleave,size,uuid,target[0-15]}
> +Date:          August, 2021
> +KernelVersion: v5.18
> +Contact:       linux-cxl@xxxxxxxxxxxxxxx
> +Description:
> +               (RW) Configuring regions requires a minimal set of parameters in
> +               order for the subsequent bind operation to succeed. The
> +               following parameters are defined:

Let's split up the descriptions into individual sections. That can
also document the order that attributes must be written. For example,
doesn't size need to be set before targets are added so that targets
can be validated whether they have sufficient capacity?

> +
> +               ==      ========================================================
> +               interleave_granularity Mandatory. Number of consecutive bytes
> +                       each device in the interleave set will claim. The
> +                       possible interleave granularity values are determined by
> +                       the CXL spec and the participating devices.
> +               interleave_ways Mandatory. Number of devices participating in the
> +                       region. Each device will provide 1/interleave of storage
> +                       for the region.
> +               size    Manadatory. Phsyical address space the region will
> +                       consume.

s/Phsyical/Physical/

> +               target  Mandatory. Memory devices are the backing storage for a
> +                       region. There will be N targets based on the number of
> +                       interleave ways that the top level decoder is configured
> +                       for.

That doesn't sound right, IW at the root != IW at the endpoint level
and the region needs to record all the endpoint level targets.

> Each target must be set with a memdev device ie.
> +                       'mem1'. This attribute only becomes available after
> +                       setting the 'interleave' attribute.
> +               uuid    Optional. A unique identifier for the region. If none is
> +                       selected, the kernel will create one.

Let's drop the Mandatory / Optional distinction, or I am otherwise not
understanding what this is trying to document. For example 'uuid' is
"mandatory" for PMEM regions and "omitted" for volatile regions not
optional.

> +               ==      ========================================================
> diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c
> index 1a448543db0d..3b48e0469fc7 100644
> --- a/drivers/cxl/core/region.c
> +++ b/drivers/cxl/core/region.c
> @@ -3,9 +3,12 @@
>  #include <linux/io-64-nonatomic-lo-hi.h>
>  #include <linux/device.h>
>  #include <linux/module.h>
> +#include <linux/sizes.h>
>  #include <linux/slab.h>
> +#include <linux/uuid.h>
>  #include <linux/idr.h>
>  #include <region.h>
> +#include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
>
> @@ -18,11 +21,305 @@
>   * (programming the hardware) is handled by a separate region driver.
>   */
>
> +struct cxl_region *to_cxl_region(struct device *dev);
> +static const struct attribute_group region_interleave_group;
> +
> +static bool is_region_active(struct cxl_region *cxlr)
> +{
> +       /* TODO: Regions can't be activated yet. */
> +       return false;

This function seems redundant with just checking "cxlr->dev.driver !=
NULL"? The benefit of that is there is no need to carry a TODO in the
series.

> +}
> +
> +static void remove_target(struct cxl_region *cxlr, int target)
> +{
> +       struct cxl_memdev *cxlmd;
> +
> +       cxlmd = cxlr->config.targets[target];
> +       if (cxlmd)
> +               put_device(&cxlmd->dev);

A memdev can be a member of multiple regions at once, shouldn't this
be an endpoint decoder or similar, not the entire memdev?

Also, if memdevs autoremove themselves from regions at memdev
->remove() time then I don't think the region needs to hold references
on memdevs.

> +       cxlr->config.targets[target] = NULL;
> +}
> +
> +static ssize_t interleave_ways_show(struct device *dev,
> +                                   struct device_attribute *attr, char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_ways);
> +}
> +
> +static ssize_t interleave_ways_store(struct device *dev,
> +                                    struct device_attribute *attr,
> +                                    const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       int ret, prev_iw;
> +       int val;

I would expect:

if (dev->driver)
   return -EBUSY;

...to shutdown configuration writes once the region is active. Might
also need a region-wide seqlock like target_list_show. So that region
probe drains  all active sysfs writers before assuming the
configuration is stable.

> +
> +       prev_iw = cxlr->config.interleave_ways;
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +       if (ret < 0 || ret > CXL_DECODER_MAX_INTERLEAVE)
> +               return -EINVAL;
> +
> +       cxlr->config.interleave_ways = val;
> +
> +       ret = sysfs_update_group(&dev->kobj, &region_interleave_group);
> +       if (ret < 0)
> +               goto err;
> +
> +       sysfs_notify(&dev->kobj, NULL, "target_interleave");

Why?

> +
> +       while (prev_iw > cxlr->config.interleave_ways)
> +               remove_target(cxlr, --prev_iw);

To make the kernel side simpler this attribute could just require that
setting interleave ways is a one way street, if you want to change it
you need to delete the region and start over.

> +
> +       return len;
> +
> +err:
> +       cxlr->config.interleave_ways = prev_iw;
> +       return ret;
> +}
> +static DEVICE_ATTR_RW(interleave_ways);
> +
> +static ssize_t interleave_granularity_show(struct device *dev,
> +                                          struct device_attribute *attr,
> +                                          char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%d\n", cxlr->config.interleave_granularity);
> +}
> +
> +static ssize_t interleave_granularity_store(struct device *dev,
> +                                           struct device_attribute *attr,
> +                                           const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       int val, ret;
> +
> +       ret = kstrtoint(buf, 0, &val);
> +       if (ret)
> +               return ret;
> +       cxlr->config.interleave_granularity = val;

This wants minimum input validation and synchronization against an
active region.

> +
> +       return len;
> +}
> +static DEVICE_ATTR_RW(interleave_granularity);
> +
> +static ssize_t offset_show(struct device *dev, struct device_attribute *attr,
> +                          char *buf)
> +{
> +       struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       resource_size_t offset;
> +
> +       if (!cxlr->res)
> +               return sysfs_emit(buf, "\n");

Should be an error I would think. I.e. require size to be set before
s/offset/resource/ can be read.

> +
> +       offset = cxld->platform_res.start - cxlr->res->start;

Why make usersapce do the offset math?

> +
> +       return sysfs_emit(buf, "%pa\n", &offset);
> +}
> +static DEVICE_ATTR_RO(offset);

This can be DEVICE_ATTR_ADMIN_RO() to hide physical address layout
information from non-root.

> +
> +static ssize_t size_show(struct device *dev, struct device_attribute *attr,
> +                        char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%llu\n", cxlr->config.size);

Perhaps no need to store size separately if this becomes:

sysfs_emit(buf, "%llu\n", (unsigned long long) resource_size(cxlr->res));

...?

> +}
> +
> +static ssize_t size_store(struct device *dev, struct device_attribute *attr,
> +                         const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       unsigned long long val;
> +       ssize_t rc;
> +
> +       rc = kstrtoull(buf, 0, &val);
> +       if (rc)
> +               return rc;
> +
> +       device_lock(&cxlr->dev);
> +       if (is_region_active(cxlr))
> +               rc = -EBUSY;
> +       else
> +               cxlr->config.size = val;
> +       device_unlock(&cxlr->dev);

I think lockdep will complain about device_lock() usage in an
attribute. Try changing this to cxl_device_lock() with
CONFIG_PROVE_CXL_LOCKING=y.

> +
> +       return rc ? rc : len;
> +}
> +static DEVICE_ATTR_RW(size);
> +
> +static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
> +                        char *buf)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       return sysfs_emit(buf, "%pUb\n", &cxlr->config.uuid);
> +}
> +
> +static ssize_t uuid_store(struct device *dev, struct device_attribute *attr,
> +                         const char *buf, size_t len)
> +{
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +       ssize_t rc;
> +
> +       if (len != UUID_STRING_LEN + 1)
> +               return -EINVAL;
> +
> +       device_lock(&cxlr->dev);
> +       if (is_region_active(cxlr))
> +               rc = -EBUSY;
> +       else
> +               rc = uuid_parse(buf, &cxlr->config.uuid);
> +       device_unlock(&cxlr->dev);
> +
> +       return rc ? rc : len;
> +}
> +static DEVICE_ATTR_RW(uuid);
> +
> +static struct attribute *region_attrs[] = {
> +       &dev_attr_interleave_ways.attr,
> +       &dev_attr_interleave_granularity.attr,
> +       &dev_attr_offset.attr,
> +       &dev_attr_size.attr,
> +       &dev_attr_uuid.attr,
> +       NULL,
> +};
> +
> +static const struct attribute_group region_group = {
> +       .attrs = region_attrs,
> +};
> +
> +static size_t show_targetN(struct cxl_region *cxlr, char *buf, int n)
> +{
> +       int ret;
> +
> +       device_lock(&cxlr->dev);
> +       if (!cxlr->config.targets[n])
> +               ret = sysfs_emit(buf, "\n");
> +       else
> +               ret = sysfs_emit(buf, "%s\n",
> +                                dev_name(&cxlr->config.targets[n]->dev));
> +       device_unlock(&cxlr->dev);

The component contribution of a memdev to a region is a DPA-span, not
the whole memdev. I would expect something like dax_mapping_attributes
or REGION_MAPPING() from drivers/nvdimm/region_devs.c. A tuple of
information about the component contribution of a memdev to a region.

> +
> +       return ret;
> +}
> +
> +static size_t set_targetN(struct cxl_region *cxlr, const char *buf, int n,
> +                         size_t len)
> +{
> +       struct device *memdev_dev;
> +       struct cxl_memdev *cxlmd;
> +
> +       device_lock(&cxlr->dev);
> +
> +       if (len == 1 || cxlr->config.targets[n])
> +               remove_target(cxlr, n);
> +
> +       /* Remove target special case */
> +       if (len == 1) {
> +               device_unlock(&cxlr->dev);
> +               return len;
> +       }
> +
> +       memdev_dev = bus_find_device_by_name(&cxl_bus_type, NULL, buf);

I think this wants to be an endpoint decoder, not a memdev. Because
it's the decoder that joins a memdev to a region, or at least a
decoder should be picked when the memdev is assigned so that the DPA
mapping can be registered. If all the decoders are allocated then fail
here.

> +       if (!memdev_dev) {
> +               device_unlock(&cxlr->dev);
> +               return -ENOENT;
> +       }
> +
> +       /* reference to memdev held until target is unset or region goes away */
> +
> +       cxlmd = to_cxl_memdev(memdev_dev);
> +       cxlr->config.targets[n] = cxlmd;
> +
> +       device_unlock(&cxlr->dev);
> +
> +       return len;
> +}
> +
> +#define TARGET_ATTR_RW(n)                                                      \
> +       static ssize_t target##n##_show(                                       \
> +               struct device *dev, struct device_attribute *attr, char *buf)  \
> +       {                                                                      \
> +               return show_targetN(to_cxl_region(dev), buf, (n));             \
> +       }                                                                      \
> +       static ssize_t target##n##_store(struct device *dev,                   \
> +                                        struct device_attribute *attr,        \
> +                                        const char *buf, size_t len)          \
> +       {                                                                      \
> +               return set_targetN(to_cxl_region(dev), buf, (n), len);         \
> +       }                                                                      \
> +       static DEVICE_ATTR_RW(target##n)
> +
> +TARGET_ATTR_RW(0);
> +TARGET_ATTR_RW(1);
> +TARGET_ATTR_RW(2);
> +TARGET_ATTR_RW(3);
> +TARGET_ATTR_RW(4);
> +TARGET_ATTR_RW(5);
> +TARGET_ATTR_RW(6);
> +TARGET_ATTR_RW(7);
> +TARGET_ATTR_RW(8);
> +TARGET_ATTR_RW(9);
> +TARGET_ATTR_RW(10);
> +TARGET_ATTR_RW(11);
> +TARGET_ATTR_RW(12);
> +TARGET_ATTR_RW(13);
> +TARGET_ATTR_RW(14);
> +TARGET_ATTR_RW(15);
> +
> +static struct attribute *interleave_attrs[] = {
> +       &dev_attr_target0.attr,
> +       &dev_attr_target1.attr,
> +       &dev_attr_target2.attr,
> +       &dev_attr_target3.attr,
> +       &dev_attr_target4.attr,
> +       &dev_attr_target5.attr,
> +       &dev_attr_target6.attr,
> +       &dev_attr_target7.attr,
> +       &dev_attr_target8.attr,
> +       &dev_attr_target9.attr,
> +       &dev_attr_target10.attr,
> +       &dev_attr_target11.attr,
> +       &dev_attr_target12.attr,
> +       &dev_attr_target13.attr,
> +       &dev_attr_target14.attr,
> +       &dev_attr_target15.attr,
> +       NULL,
> +};
> +
> +static umode_t visible_targets(struct kobject *kobj, struct attribute *a, int n)
> +{
> +       struct device *dev = container_of(kobj, struct device, kobj);
> +       struct cxl_region *cxlr = to_cxl_region(dev);
> +
> +       if (n < cxlr->config.interleave_ways)
> +               return a->mode;
> +       return 0;
> +}
> +
> +static const struct attribute_group region_interleave_group = {
> +       .attrs = interleave_attrs,
> +       .is_visible = visible_targets,
> +};
> +
> +static const struct attribute_group *region_groups[] = {
> +       &region_group,
> +       &region_interleave_group,
> +       NULL,
> +};
> +
>  static void cxl_region_release(struct device *dev);
>
>  static const struct device_type cxl_region_type = {
>         .name = "cxl_region",
>         .release = cxl_region_release,
> +       .groups = region_groups
>  };
>
>  static ssize_t create_region_show(struct device *dev,
> @@ -108,8 +405,11 @@ static void cxl_region_release(struct device *dev)
>  {
>         struct cxl_decoder *cxld = to_cxl_decoder(dev->parent);
>         struct cxl_region *cxlr = to_cxl_region(dev);
> +       int i;
>
>         ida_free(&cxld->region_ida, cxlr->id);
> +       for (i = 0; i < cxlr->config.interleave_ways; i++)
> +               remove_target(cxlr, i);

Like the last patch this feels too late. I expect whatever unregisters
the region should have already handled removing the targets.

>         kfree(cxlr);
>  }
>
> --
> 2.35.0
>