On 11/6/2024 2:38 AM, ira.weiny@xxxxxxxxx wrote: > From: Navneet Singh <navneet.singh@xxxxxxxxx> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. Eating duplicates serves three purposes. First, this > simplifies the code if the device should get out of sync with > the host. And it should be safe to acknowledge the extent > again. Second, this simplifies the code to process existing > extents if the extent list should change while the extent > list is being read. Third, duplicates for a given region > which are seen during a race between the hardware surfacing > an extent and the cxl dax driver scanning for existing > extents will be ignored. > > NOTE: Processing existing extents is done in a later patch. > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Tag support within the DAX layer is not yet supported. To maintain > compatibility legacy DAX/region processing only tags with a value of 0 > are allowed. This defines existing DAX devices as having a 0 tag which > makes the most logical sense as a default. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@xxxxxxxxx> > Reviewed-by: Dave Jiang <dave.jiang@xxxxxxxxx> > Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx> > > --- > Changes: > [Jonathan: include xarray headers as appropriate] > [iweiny: Use UUID format specifier for tag values in debug messages] > --- > drivers/cxl/core/Makefile | 2 +- > drivers/cxl/core/core.h | 13 ++ > drivers/cxl/core/extent.c | 371 ++++++++++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/mbox.c | 295 +++++++++++++++++++++++++++++++++++- > drivers/cxl/core/region.c | 3 + > drivers/cxl/cxl.h | 53 ++++++- > drivers/cxl/cxlmem.h | 27 ++++ > include/cxl/event.h | 32 ++++ > tools/testing/cxl/Kbuild | 3 +- > 9 files changed, 795 insertions(+), 4 deletions(-) > [snip] > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > + struct xarray *extent_array, int cnt) > +{ > + struct cxl_mailbox *cxl_mbox = &mds->cxlds.cxl_mbox; > + struct cxl_mbox_dc_response *p; > + struct cxl_extent *extent; > + unsigned long index; > + u32 pl_index; > + > + size_t pl_size = struct_size(p, extent_list, cnt); > + u32 max_extents = cnt; > + > + /* May have to use more bit on response. */ > + if (pl_size > cxl_mbox->payload_size) { > + max_extents = (cxl_mbox->payload_size - sizeof(*p)) / > + sizeof(struct updated_extent_list); > + pl_size = struct_size(p, extent_list, max_extents); > + } > + > + struct cxl_mbox_dc_response *response __free(kfree) = > + kzalloc(pl_size, GFP_KERNEL); > + if (!response) > + return -ENOMEM; > + > + if (cnt == 0) > + return send_one_response(cxl_mbox, response, opcode, 0, 0); > + > + pl_index = 0; > + xa_for_each(extent_array, index, extent) { > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > + response->extent_list[pl_index].length = extent->length; > + pl_index++; > + > + if (pl_index == max_extents) { > + u8 flags = 0; > + int rc; > + > + if (pl_index < cnt) > + flags &= CXL_DCD_EVENT_MORE; Should be 'flags |= CXL_DCD_EVENT_MORE' here. Other looks good to me. Reviewed-by: Li Ming <ming4.li@xxxxxxxxx> > + rc = send_one_response(cxl_mbox, response, opcode, > + pl_index, flags); > + if (rc) > + return rc; > + cnt -= pl_index; > + pl_index = 0; > + } > + } > + > + if (!pl_index) /* nothing more to do */ > + return 0; > + return send_one_response(cxl_mbox, response, opcode, pl_index, 0); > +} > + > +void memdev_release_extent(struct cxl_memdev_state *mds, struct range *range) > +{ > + struct device *dev = mds->cxlds.dev; > + struct xarray extent_list; > + > + struct cxl_extent extent = { > + .start_dpa = cpu_to_le64(range->start), > + .length = cpu_to_le64(range_len(range)), > + }; > + > + dev_dbg(dev, "Release response dpa [range 0x%016llx-0x%016llx]\n", > + range->start, range->end); > + > + xa_init(&extent_list); > + if (xa_insert(&extent_list, 0, &extent, GFP_KERNEL)) { > + dev_dbg(dev, "Failed to release [range 0x%016llx-0x%016llx]\n", > + range->start, range->end); > + goto destroy; > + } > + > + if (cxl_send_dc_response(mds, CXL_MBOX_OP_RELEASE_DC, &extent_list, 1)) > + dev_dbg(dev, "Failed to release [range 0x%016llx-0x%016llx]\n", > + range->start, range->end); > + > +destroy: > + xa_destroy(&extent_list); > +} > + > +static int validate_add_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + int rc; > + > + rc = cxl_validate_extent(mds, extent); > + if (rc) > + return rc; > + > + return cxl_add_extent(mds, extent); > +} > + > +static int cxl_add_pending(struct cxl_memdev_state *mds) > +{ > + struct device *dev = mds->cxlds.dev; > + struct cxl_extent *extent; > + unsigned long cnt = 0; > + unsigned long index; > + int rc; > + > + xa_for_each(&mds->pending_extents, index, extent) { > + if (validate_add_extent(mds, extent)) { > + /* > + * Any extents which are to be rejected are omitted from > + * the response. An empty response means all are > + * rejected. > + */ > + dev_dbg(dev, "unconsumed DC extent DPA:%#llx LEN:%#llx\n", > + le64_to_cpu(extent->start_dpa), > + le64_to_cpu(extent->length)); > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + continue; > + } > + cnt++; > + } > + rc = cxl_send_dc_response(mds, CXL_MBOX_OP_ADD_DC_RESPONSE, > + &mds->pending_extents, cnt); > + xa_for_each(&mds->pending_extents, index, extent) { > + xa_erase(&mds->pending_extents, index); > + kfree(extent); > + } > + return rc; > +} > + > +static int handle_add_event(struct cxl_memdev_state *mds, > + struct cxl_event_dcd *event) > +{ > + struct device *dev = mds->cxlds.dev; > + struct cxl_extent *extent; > + > + extent = kmemdup(&event->extent, sizeof(*extent), GFP_KERNEL); > + if (!extent) > + return -ENOMEM; > + > + if (xa_insert(&mds->pending_extents, (unsigned long)extent, extent, > + GFP_KERNEL)) { > + kfree(extent); > + return -ENOMEM; > + } > + > + if (event->flags & CXL_DCD_EVENT_MORE) { > + dev_dbg(dev, "more bit set; delay the surfacing of extent\n"); > + return 0; > + } > + > + /* extents are removed and free'ed in cxl_add_pending() */ > + return cxl_add_pending(mds); > +} > + > +static char *cxl_dcd_evt_type_str(u8 type) > +{ > + switch (type) { > + case DCD_ADD_CAPACITY: > + return "add"; > + case DCD_RELEASE_CAPACITY: > + return "release"; > + case DCD_FORCED_CAPACITY_RELEASE: > + return "force release"; > + default: > + break; > + } > + > + return "<unknown>"; > +} > + > +static void cxl_handle_dcd_event_records(struct cxl_memdev_state *mds, > + struct cxl_event_record_raw *raw_rec) > +{ > + struct cxl_event_dcd *event = &raw_rec->event.dcd; > + struct cxl_extent *extent = &event->extent; > + struct device *dev = mds->cxlds.dev; > + uuid_t *id = &raw_rec->id; > + int rc; > + > + if (!uuid_equal(id, &CXL_EVENT_DC_EVENT_UUID)) > + return; > + > + dev_dbg(dev, "DCD event %s : DPA:%#llx LEN:%#llx\n", > + cxl_dcd_evt_type_str(event->event_type), > + le64_to_cpu(extent->start_dpa), le64_to_cpu(extent->length)); > + > + switch (event->event_type) { > + case DCD_ADD_CAPACITY: > + rc = handle_add_event(mds, event); > + break; > + case DCD_RELEASE_CAPACITY: > + rc = cxl_rm_extent(mds, &event->extent); > + break; > + case DCD_FORCED_CAPACITY_RELEASE: > + dev_err_ratelimited(dev, "Forced release event ignored.\n"); > + rc = 0; > + break; > + default: > + rc = -EINVAL; > + break; > + } > + > + if (rc) > + dev_err_ratelimited(dev, "dcd event failed: %d\n", rc); > +} > + > static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > enum cxl_event_log_type type) > { > @@ -1053,9 +1324,13 @@ static void cxl_mem_get_records_log(struct cxl_memdev_state *mds, > if (!nr_rec) > break; > > - for (i = 0; i < nr_rec; i++) > + for (i = 0; i < nr_rec; i++) { > __cxl_event_trace_record(cxlmd, type, > &payload->records[i]); > + if (type == CXL_EVENT_TYPE_DCD) > + cxl_handle_dcd_event_records(mds, > + &payload->records[i]); > + } > > if (payload->flags & CXL_GET_EVENT_FLAG_OVERFLOW) > trace_cxl_overflow(cxlmd, type, payload); > @@ -1087,6 +1362,8 @@ void cxl_mem_get_event_records(struct cxl_memdev_state *mds, u32 status) > { > dev_dbg(mds->cxlds.dev, "Reading event logs: %x\n", status); > > + if (cxl_dcd_supported(mds) && (status & CXLDEV_EVENT_STATUS_DCD)) > + cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_DCD); > if (status & CXLDEV_EVENT_STATUS_FATAL) > cxl_mem_get_records_log(mds, CXL_EVENT_TYPE_FATAL); > if (status & CXLDEV_EVENT_STATUS_FAIL) > @@ -1632,9 +1909,21 @@ int cxl_mailbox_init(struct cxl_mailbox *cxl_mbox, struct device *host) > } > EXPORT_SYMBOL_NS_GPL(cxl_mailbox_init, CXL); > > +static void clear_pending_extents(void *_mds) > +{ > + struct cxl_memdev_state *mds = _mds; > + struct cxl_extent *extent; > + unsigned long index; > + > + xa_for_each(&mds->pending_extents, index, extent) > + kfree(extent); > + xa_destroy(&mds->pending_extents); > +} > + > struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > { > struct cxl_memdev_state *mds; > + int rc; > > mds = devm_kzalloc(dev, sizeof(*mds), GFP_KERNEL); > if (!mds) { > @@ -1651,6 +1940,10 @@ struct cxl_memdev_state *cxl_memdev_state_create(struct device *dev) > mds->pmem_perf.qos_class = CXL_QOS_CLASS_INVALID; > for (int i = 0; i < CXL_MAX_DC_REGION; i++) > mds->dc_perf[i].qos_class = CXL_QOS_CLASS_INVALID; > + xa_init(&mds->pending_extents); > + rc = devm_add_action_or_reset(dev, clear_pending_extents, mds); > + if (rc) > + return ERR_PTR(rc); > > return mds; > } > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index a0c181cc33e4988e5c841d5b009d3d4aed5606c1..6ae51fc2bdae22fc25cc73773916714171512e92 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -3036,6 +3036,7 @@ static void cxl_dax_region_release(struct device *dev) > { > struct cxl_dax_region *cxlr_dax = to_cxl_dax_region(dev); > > + ida_destroy(&cxlr_dax->extent_ida); > kfree(cxlr_dax); > } > > @@ -3089,6 +3090,8 @@ static struct cxl_dax_region *cxl_dax_region_alloc(struct cxl_region *cxlr) > > dev = &cxlr_dax->dev; > cxlr_dax->cxlr = cxlr; > + cxlr->cxlr_dax = cxlr_dax; > + ida_init(&cxlr_dax->extent_ida); > device_initialize(dev); > lockdep_set_class(&dev->mutex, &cxl_dax_region_key); > device_set_pm_not_required(dev); > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index 486ceaafa85c3ac1efd438b6d6b9ccd0860dde45..990d0b2c5393fb2f81f36f928988412c48a17333 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -11,6 +11,8 @@ > #include <linux/log2.h> > #include <linux/node.h> > #include <linux/io.h> > +#include <linux/xarray.h> > +#include <cxl/event.h> > > extern const struct nvdimm_security_ops *cxl_security_ops; > > @@ -169,11 +171,13 @@ static inline int ways_to_eiw(unsigned int ways, u8 *eiw) > #define CXLDEV_EVENT_STATUS_WARN BIT(1) > #define CXLDEV_EVENT_STATUS_FAIL BIT(2) > #define CXLDEV_EVENT_STATUS_FATAL BIT(3) > +#define CXLDEV_EVENT_STATUS_DCD BIT(4) > > #define CXLDEV_EVENT_STATUS_ALL (CXLDEV_EVENT_STATUS_INFO | \ > CXLDEV_EVENT_STATUS_WARN | \ > CXLDEV_EVENT_STATUS_FAIL | \ > - CXLDEV_EVENT_STATUS_FATAL) > + CXLDEV_EVENT_STATUS_FATAL | \ > + CXLDEV_EVENT_STATUS_DCD) > > /* CXL rev 3.0 section 8.2.9.2.4; Table 8-52 */ > #define CXLDEV_EVENT_INT_MODE_MASK GENMASK(1, 0) > @@ -442,6 +446,18 @@ enum cxl_decoder_state { > CXL_DECODER_STATE_AUTO, > }; > > +/** > + * struct cxled_extent - Extent within an endpoint decoder > + * @cxled: Reference to the endpoint decoder > + * @dpa_range: DPA range this extent covers within the decoder > + * @tag: Tag from device for this extent > + */ > +struct cxled_extent { > + struct cxl_endpoint_decoder *cxled; > + struct range dpa_range; > + u8 tag[CXL_EXTENT_TAG_LEN]; > +}; > + > /** > * struct cxl_endpoint_decoder - Endpoint / SPA to DPA decoder > * @cxld: base cxl_decoder_object > @@ -567,6 +583,7 @@ struct cxl_region_params { > * @type: Endpoint decoder target type > * @cxl_nvb: nvdimm bridge for coordinating @cxlr_pmem setup / shutdown > * @cxlr_pmem: (for pmem regions) cached copy of the nvdimm bridge > + * @cxlr_dax: (for DC regions) cached copy of CXL DAX bridge > * @flags: Region state flags > * @params: active + config params for the region > * @coord: QoS access coordinates for the region > @@ -580,6 +597,7 @@ struct cxl_region { > enum cxl_decoder_type type; > struct cxl_nvdimm_bridge *cxl_nvb; > struct cxl_pmem_region *cxlr_pmem; > + struct cxl_dax_region *cxlr_dax; > unsigned long flags; > struct cxl_region_params params; > struct access_coordinate coord[ACCESS_COORDINATE_MAX]; > @@ -620,12 +638,45 @@ struct cxl_pmem_region { > struct cxl_pmem_region_mapping mapping[]; > }; > > +/* See CXL 3.1 8.2.9.2.1.6 */ > +enum dc_event { > + DCD_ADD_CAPACITY, > + DCD_RELEASE_CAPACITY, > + DCD_FORCED_CAPACITY_RELEASE, > + DCD_REGION_CONFIGURATION_UPDATED, > +}; > + > struct cxl_dax_region { > struct device dev; > struct cxl_region *cxlr; > struct range hpa_range; > + struct ida extent_ida; > }; > > +/** > + * struct region_extent - CXL DAX region extent > + * @dev: device representing this extent > + * @cxlr_dax: back reference to parent region device > + * @hpa_range: HPA range of this extent > + * @tag: tag of the extent > + * @decoder_extents: Endpoint decoder extents which make up this region extent > + */ > +struct region_extent { > + struct device dev; > + struct cxl_dax_region *cxlr_dax; > + struct range hpa_range; > + uuid_t tag; > + struct xarray decoder_extents; > +}; > + > +bool is_region_extent(struct device *dev); > +static inline struct region_extent *to_region_extent(struct device *dev) > +{ > + if (!is_region_extent(dev)) > + return NULL; > + return container_of(dev, struct region_extent, dev); > +} > + > /** > * struct cxl_port - logical collection of upstream port devices and > * downstream port devices to construct a CXL memory > diff --git a/drivers/cxl/cxlmem.h b/drivers/cxl/cxlmem.h > index 863899b295b719b57638ee060e494e5cf2d639fd..73dee28bbd803a8f78686e833f8ef3492ca94e66 100644 > --- a/drivers/cxl/cxlmem.h > +++ b/drivers/cxl/cxlmem.h > @@ -7,6 +7,7 @@ > #include <linux/cdev.h> > #include <linux/uuid.h> > #include <linux/node.h> > +#include <linux/xarray.h> > #include <cxl/event.h> > #include <cxl/mailbox.h> > #include "cxl.h" > @@ -506,6 +507,7 @@ static inline struct cxl_dev_state *mbox_to_cxlds(struct cxl_mailbox *cxl_mbox) > * @pmem_perf: performance data entry matched to PMEM partition > * @nr_dc_region: number of DC regions implemented in the memory device > * @dc_region: array containing info about the DC regions > + * @pending_extents: array of extents pending during more bit processing > * @event: event log driver state > * @poison: poison driver state info > * @security: security driver state info > @@ -538,6 +540,7 @@ struct cxl_memdev_state { > u8 nr_dc_region; > struct cxl_dc_region_info dc_region[CXL_MAX_DC_REGION]; > struct cxl_dpa_perf dc_perf[CXL_MAX_DC_REGION]; > + struct xarray pending_extents; > > struct cxl_event_state event; > struct cxl_poison_state poison; > @@ -609,6 +612,21 @@ enum cxl_opcode { > UUID_INIT(0x5e1819d9, 0x11a9, 0x400c, 0x81, 0x1f, 0xd6, 0x07, 0x19, \ > 0x40, 0x3d, 0x86) > > +/* > + * Add Dynamic Capacity Response > + * CXL rev 3.1 section 8.2.9.9.9.3; Table 8-168 & Table 8-169 > + */ > +struct cxl_mbox_dc_response { > + __le32 extent_list_size; > + u8 flags; > + u8 reserved[3]; > + struct updated_extent_list { > + __le64 dpa_start; > + __le64 length; > + u8 reserved[8]; > + } __packed extent_list[]; > +} __packed; > + > struct cxl_mbox_get_supported_logs { > __le16 entries; > u8 rsvd[6]; > @@ -671,6 +689,14 @@ struct cxl_mbox_identify { > UUID_INIT(0xfe927475, 0xdd59, 0x4339, 0xa5, 0x86, 0x79, 0xba, 0xb1, \ > 0x13, 0xb7, 0x74) > > +/* > + * Dynamic Capacity Event Record > + * CXL rev 3.1 section 8.2.9.2.1; Table 8-43 > + */ > +#define CXL_EVENT_DC_EVENT_UUID \ > + UUID_INIT(0xca95afa7, 0xf183, 0x4018, 0x8c, 0x2f, 0x95, 0x26, 0x8e, \ > + 0x10, 0x1a, 0x2a) > + > /* > * Get Event Records output payload > * CXL rev 3.0 section 8.2.9.2.2; Table 8-50 > @@ -696,6 +722,7 @@ enum cxl_event_log_type { > CXL_EVENT_TYPE_WARN, > CXL_EVENT_TYPE_FAIL, > CXL_EVENT_TYPE_FATAL, > + CXL_EVENT_TYPE_DCD, > CXL_EVENT_TYPE_MAX > }; > > diff --git a/include/cxl/event.h b/include/cxl/event.h > index 0bea1afbd747c4937b15703b581c569e7fa45ae4..eeda8059d81abef2fbf28cd3f3a6e516c9710229 100644 > --- a/include/cxl/event.h > +++ b/include/cxl/event.h > @@ -96,11 +96,43 @@ struct cxl_event_mem_module { > u8 reserved[0x3d]; > } __packed; > > +/* > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-51 > + */ > +#define CXL_EXTENT_TAG_LEN 0x10 > +struct cxl_extent { > + __le64 start_dpa; > + __le64 length; > + u8 tag[CXL_EXTENT_TAG_LEN]; > + __le16 shared_extn_seq; > + u8 reserved[0x6]; > +} __packed; > + > +/* > + * Dynamic Capacity Event Record > + * CXL rev 3.1 section 8.2.9.2.1.6; Table 8-50 > + */ > +#define CXL_DCD_EVENT_MORE BIT(0) > +struct cxl_event_dcd { > + struct cxl_event_record_hdr hdr; > + u8 event_type; > + u8 validity_flags; > + __le16 host_id; > + u8 region_index; > + u8 flags; > + u8 reserved1[0x2]; > + struct cxl_extent extent; > + u8 reserved2[0x18]; > + __le32 num_avail_extents; > + __le32 num_avail_tags; > +} __packed; > + > union cxl_event { > struct cxl_event_generic generic; > struct cxl_event_gen_media gen_media; > struct cxl_event_dram dram; > struct cxl_event_mem_module mem_module; > + struct cxl_event_dcd dcd; > /* dram & gen_media event header */ > struct cxl_event_media_hdr media_hdr; > } __packed; > diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild > index b1256fee3567fc7743812ee14bc46e09b7c8ba9b..bfa19587fd763ed552c2b9aa1a6e8981b6aa1c40 100644 > --- a/tools/testing/cxl/Kbuild > +++ b/tools/testing/cxl/Kbuild > @@ -62,7 +62,8 @@ cxl_core-y += $(CXL_CORE_SRC)/hdm.o > cxl_core-y += $(CXL_CORE_SRC)/pmu.o > cxl_core-y += $(CXL_CORE_SRC)/cdat.o > cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o > -cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o > +cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o \ > + $(CXL_CORE_SRC)/extent.o > cxl_core-y += config_check.o > cxl_core-y += cxl_core_test.o > cxl_core-y += cxl_core_exports.o >