On 8/16/2024 10:44 PM, ira.weiny@xxxxxxxxx wrote: > From: Navneet Singh <navneet.singh@xxxxxxxxx> > > A dynamic capacity device (DCD) sends events to signal the host for > changes in the availability of Dynamic Capacity (DC) memory. These > events contain extents describing a DPA range and meta data for memory > to be added or removed. Events may be sent from the device at any time. > > Three types of events can be signaled, Add, Release, and Force Release. > > On add, the host may accept or reject the memory being offered. If no > region exists, or the extent is invalid, the extent should be rejected. > Add extent events may be grouped by a 'more' bit which indicates those > extents should be processed as a group. > > On remove, the host can delay the response until the host is safely not > using the memory. If no region exists the release can be sent > immediately. The host may also release extents (or partial extents) at > any time. Thus the 'more' bit grouping of release events is of less > value and can be ignored in favor of sending multiple release capacity > responses for groups of release events. > > Force removal is intended as a mechanism between the FM and the device > and intended only when the host is unresponsive, out of sync, or > otherwise broken. Purposely ignore force removal events. > > Regions are made up of one or more devices which may be surfacing memory > to the host. Once all devices in a region have surfaced an extent the > region can expose a corresponding extent for the user to consume. > Without interleaving a device extent forms a 1:1 relationship with the > region extent. Immediately surface a region extent upon getting a > device extent. > > Per the specification the device is allowed to offer or remove extents > at any time. However, anticipated use cases can expect extents to be > offered, accepted, and removed in well defined chunks. > > Simplify extent tracking with the following restrictions. > > 1) Flag for removal any extent which overlaps a requested > release range. > 2) Refuse the offer of extents which overlap already accepted > memory ranges. > 3) Accept again a range which has already been accepted by the > host. (It is likely the device has an error because it > should already know that this range was accepted. But from > the host point of view it is safe to acknowledge that > acceptance again.) > > Management of the region extent devices must be synchronized with > potential uses of the memory within the DAX layer. Create region extent > devices as children of the cxl_dax_region device such that the DAX > region driver can co-drive them and synchronize with the DAX layer. > Synchronization and management is handled in a subsequent patch. > > Process DCD events and create region devices. > > Signed-off-by: Navneet Singh <navneet.singh@xxxxxxxxx> > Co-developed-by: Ira Weiny <ira.weiny@xxxxxxxxx> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx> > > --- > Changes: > [iweiny: combine this with the extent surface patches to better show the > lifetime extent objects in review] > [iweiny: clean up commit message.] > [iweiny: move extent verification of the 'read extents on region > creation' to this patch] > [iweiny: Provide for a common path for extent realization between an add > event and adding existing extents.] > [iweiny: Persist a check that an extent is within an endpoint decoder] > [iweiny: reduce exported and non-static calls] > [iweiny: use %par] > > <Combined comments from the old patches which were addressed> > > [Jonathan: implement the more bit with a simple algorithm which accepts > all extents it can. > Also include the response more bit to prevent payload > overflow] > [Fan: Do not error if a contained extent is added.] > [Jonathan: allocate ida after kzalloc] > [iweiny: fix ida resource leak] > [fan/djiang: remove unneeded memset] > [djiang: fix indentation] > [Jonathan: Fix indentation] > [Jonathan/djbw: make tag a uuid] > [djbw: create helper calc_hpa_range() straight away] > [djbw: Allow for multiple cxled_extents per region_extent] > [djbw: s/cxl_ed/cxled] > [djbw: s/cxl_release_ed_extent/cxled_release_extent/] > [djbw: s/reg_ext/region_extent/] > [djbw: s/dc_extent/extent/] > [Gregory/djbw: reject shared extents] > [iweiny: predicate extent.c compile on CONFIG_CXL_REGION] > --- > drivers/cxl/core/Makefile | 2 +- > drivers/cxl/core/core.h | 13 ++ > drivers/cxl/core/extent.c | 345 ++++++++++++++++++++++++++++++++++++++++++++++ > drivers/cxl/core/mbox.c | 268 ++++++++++++++++++++++++++++++++++- > drivers/cxl/core/region.c | 6 + > drivers/cxl/cxl.h | 52 ++++++- > drivers/cxl/cxlmem.h | 26 ++++ > include/linux/cxl-event.h | 32 +++++ > tools/testing/cxl/Kbuild | 3 +- > 9 files changed, 743 insertions(+), 4 deletions(-) [...] > + > +static bool extents_contain(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_contains); Is it better to use __free(put_device) here to drop below 'put_device(extent_device)'? > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static int match_overlaps(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct match_data *md = data; > + struct cxled_extent *entry; > + unsigned long index; > + > + if (!region_extent) > + return 0; > + > + xa_for_each(®ion_extent->decoder_extents, index, entry) { > + if (md->cxled == entry->cxled && > + range_overlaps(&entry->dpa_range, md->new_range)) > + return true; > + } > + > + return false; > +} > + > +static bool extents_overlap(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct range *new_range) > +{ > + struct device *extent_device; > + struct match_data md = { > + .cxled = cxled, > + .new_range = new_range, > + }; > + > + extent_device = device_find_child(&cxlr_dax->dev, &md, match_overlaps); Same as above. > + if (!extent_device) > + return false; > + > + put_device(extent_device); > + return true; > +} > + > +static void calc_hpa_range(struct cxl_endpoint_decoder *cxled, > + struct cxl_dax_region *cxlr_dax, > + struct range *dpa_range, > + struct range *hpa_range) > +{ > + resource_size_t dpa_offset, hpa; > + > + dpa_offset = dpa_range->start - cxled->dpa_res->start; > + hpa = cxled->cxld.hpa_range.start + dpa_offset; > + > + hpa_range->start = hpa - cxlr_dax->hpa_range.start; > + hpa_range->end = hpa_range->start + range_len(dpa_range) - 1; > +} > + > +static int cxlr_rm_extent(struct device *dev, void *data) > +{ > + struct region_extent *region_extent = to_region_extent(dev); > + struct range *region_hpa_range = data; > + > + if (!region_extent) > + return 0; > + > + /* > + * Any extent which 'touches' the released range is removed. > + */ > + if (range_overlaps(region_hpa_range, ®ion_extent->hpa_range)) { > + dev_dbg(dev, "Remove region extent HPA %par\n", > + ®ion_extent->hpa_range); > + region_rm_extent(region_extent); > + } > + return 0; > +} > + > +int cxl_rm_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range hpa_range, dpa_range; > + struct cxl_region *cxlr; > + > + dpa_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) { > + memdev_release_extent(mds, &dpa_range); > + return -ENXIO; > + } > + > + calc_hpa_range(cxled, cxlr->cxlr_dax, &dpa_range, &hpa_range); > + > + /* Remove region extents which overlap */ > + return device_for_each_child(&cxlr->cxlr_dax->dev, &hpa_range, > + cxlr_rm_extent); > +} > + > +static int cxlr_add_extent(struct cxl_dax_region *cxlr_dax, > + struct cxl_endpoint_decoder *cxled, > + struct cxled_extent *ed_extent) > +{ > + struct region_extent *region_extent; > + struct range hpa_range; > + int rc; > + > + calc_hpa_range(cxled, cxlr_dax, &ed_extent->dpa_range, &hpa_range); > + > + region_extent = alloc_region_extent(cxlr_dax, &hpa_range, ed_extent->tag); > + if (IS_ERR(region_extent)) > + return PTR_ERR(region_extent); > + > + rc = xa_insert(®ion_extent->decoder_extents, (unsigned long)ed_extent, ed_extent, > + GFP_KERNEL); > + if (rc) { > + free_region_extent(region_extent); > + return rc; > + } > + > + /* device model handles freeing region_extent */ > + return online_region_extent(region_extent); > +} > + > +/* Callers are expected to ensure cxled has been attached to a region */ > +int cxl_add_extent(struct cxl_memdev_state *mds, struct cxl_extent *extent) > +{ > + u64 start_dpa = le64_to_cpu(extent->start_dpa); > + struct cxl_memdev *cxlmd = mds->cxlds.cxlmd; > + struct cxl_endpoint_decoder *cxled; > + struct range ed_range, ext_range; > + struct cxl_dax_region *cxlr_dax; > + struct cxled_extent *ed_extent; > + struct cxl_region *cxlr; > + struct device *dev; > + > + ext_range = (struct range) { > + .start = start_dpa, > + .end = start_dpa + le64_to_cpu(extent->length) - 1, > + }; > + > + guard(rwsem_read)(&cxl_region_rwsem); > + cxlr = cxl_dpa_to_region(cxlmd, start_dpa, &cxled); > + if (!cxlr) > + return -ENXIO; > + > + cxlr_dax = cxled->cxld.region->cxlr_dax; > + dev = &cxled->cxld.dev; > + ed_range = (struct range) { > + .start = cxled->dpa_res->start, > + .end = cxled->dpa_res->end, > + }; > + > + dev_dbg(&cxled->cxld.dev, "Checking ED (%pr) for extent %par\n", > + cxled->dpa_res, &ext_range); > + > + if (!range_contains(&ed_range, &ext_range)) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not fully in ED %par\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag, &ed_range); > + return -ENXIO; > + } > + > + if (extents_contain(cxlr_dax, cxled, &ext_range)) > + return 0; > + > + if (extents_overlap(cxlr_dax, cxled, &ext_range)) > + return -ENXIO; > + > + ed_extent = kzalloc(sizeof(*ed_extent), GFP_KERNEL); > + if (!ed_extent) > + return -ENOMEM; > + > + ed_extent->cxled = cxled; > + ed_extent->dpa_range = ext_range; > + memcpy(ed_extent->tag, extent->tag, CXL_EXTENT_TAG_LEN); > + > + dev_dbg(dev, "Add extent %par (%*phC)\n", &ed_extent->dpa_range, > + CXL_EXTENT_TAG_LEN, ed_extent->tag); > + > + return cxlr_add_extent(cxlr_dax, cxled, ed_extent); > +} > diff --git a/drivers/cxl/core/mbox.c b/drivers/cxl/core/mbox.c > index 01a447aaa1b1..f629ad7488ac 100644 > --- a/drivers/cxl/core/mbox.c > +++ b/drivers/cxl/core/mbox.c > @@ -882,6 +882,48 @@ int cxl_enumerate_cmds(struct cxl_memdev_state *mds) > } > EXPORT_SYMBOL_NS_GPL(cxl_enumerate_cmds, CXL); > > +static int cxl_validate_extent(struct cxl_memdev_state *mds, > + struct cxl_extent *extent) > +{ > + u64 start = le64_to_cpu(extent->start_dpa); > + u64 length = le64_to_cpu(extent->length); > + struct device *dev = mds->cxlds.dev; > + > + struct range ext_range = (struct range){ > + .start = start, > + .end = start + length - 1, > + }; > + > + if (le16_to_cpu(extent->shared_extn_seq) != 0) { > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) can not be shared\n", > + &ext_range.start, CXL_EXTENT_TAG_LEN, > + extent->tag); > + return -ENXIO; > + } > + > + /* Extents must not cross DC region boundary's */ > + for (int i = 0; i < mds->nr_dc_region; i++) { > + struct cxl_dc_region_info *dcr = &mds->dc_region[i]; > + struct range region_range = (struct range) { > + .start = dcr->base, > + .end = dcr->base + dcr->decode_len - 1, > + }; > + > + if (range_contains(®ion_range, &ext_range)) { > + dev_dbg(dev, "DC extent DPA %par (DCR:%d:%#llx)(%*phC)\n", > + &ext_range, i, start - dcr->base, > + CXL_EXTENT_TAG_LEN, extent->tag); > + return 0; > + } > + } > + > + dev_err_ratelimited(dev, > + "DC extent DPA %par (%*phC) is not in any DC region\n", > + &ext_range, CXL_EXTENT_TAG_LEN, extent->tag); > + return -ENXIO; > +} > + > void cxl_event_trace_record(const struct cxl_memdev *cxlmd, > enum cxl_event_log_type type, > enum cxl_event_type event_type, > @@ -1009,6 +1051,207 @@ static int cxl_clear_event_record(struct cxl_memdev_state *mds, > return rc; > } > > +static int cxl_send_dc_response(struct cxl_memdev_state *mds, int opcode, > + struct xarray *extent_array, int cnt) > +{ > + struct cxl_mbox_dc_response *p; > + struct cxl_mbox_cmd mbox_cmd; > + struct cxl_extent *extent; > + unsigned long index; > + u32 pl_index; > + int rc = 0; > + > + size_t pl_size = struct_size(p, extent_list, cnt); > + u32 max_extents = cnt; > + > + /* May have to use more bit on response. */ > + if (pl_size > mds->payload_size) { > + max_extents = (mds->payload_size - sizeof(*p)) / > + sizeof(struct updated_extent_list); > + pl_size = struct_size(p, extent_list, max_extents); > + } > + > + struct cxl_mbox_dc_response *response __free(kfree) = > + kzalloc(pl_size, GFP_KERNEL); > + if (!response) > + return -ENOMEM; > + > + pl_index = 0; > + xa_for_each(extent_array, index, extent) { > + > + response->extent_list[pl_index].dpa_start = extent->start_dpa; > + response->extent_list[pl_index].length = extent->length; > + pl_index++; > + response->extent_list_size = cpu_to_le32(pl_index); > + > + if (pl_index == max_extents) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + if (pl_index < cnt) > + response->flags &= CXL_DCD_EVENT_MORE; Should "response->flags |= CXL_DCD_EVENT_MORE"? And seems like there is a bug if the value of 'cnt' is double the value of 'max_extents'. the response command will be sent in this xa_for_each() scope twice, and CXL_DCD_EVENT_MORE will be set for both times. because 'pl_index < cnt' is always true. > + > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + if (rc) > + return rc; > + pl_index = 0; > + } > + } > + > + if (pl_index) { > + mbox_cmd = (struct cxl_mbox_cmd) { > + .opcode = opcode, > + .size_in = struct_size(response, extent_list, > + pl_index), > + .payload_in = response, > + }; > + > + response->flags = 0; > + rc = cxl_internal_send_cmd(mds, &mbox_cmd); > + } > + > + return rc; > +} > +