> On Aug 11, 2015, at 12:23 AM, Michael Neuling <mikey@xxxxxxxxxxx> wrote: > > Some comments inline Thanks for reviewing, responses below. > On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote: >> Add superpipe supporting infrastructure to device driver for the IBM CXL >> Flash adapter. This patch allows userspace applications to take advantage >> of the accelerated I/O features that this adapter provides and bypass the >> traditional filesystem stack. >> >> Signed-off-by: Matthew R. Ochs <mrochs@xxxxxxxxxxxxxxxxxx> >> Signed-off-by: Manoj N. Kumar <manoj@xxxxxxxxxxxxxxxxxx> >> --- >> Documentation/ioctl/ioctl-number.txt | 1 + >> Documentation/powerpc/cxlflash.txt | 297 +++++ >> drivers/scsi/cxlflash/Makefile | 2 +- >> drivers/scsi/cxlflash/common.h | 19 + >> drivers/scsi/cxlflash/main.c | 21 +- >> drivers/scsi/cxlflash/superpipe.c | 2206 ++++++++++++++++++++++++++++++++++ >> drivers/scsi/cxlflash/superpipe.h | 127 ++ >> include/uapi/scsi/Kbuild | 1 + >> include/uapi/scsi/cxlflash_ioctl.h | 139 +++ >> 9 files changed, 2810 insertions(+), 3 deletions(-) >> create mode 100644 Documentation/powerpc/cxlflash.txt >> create mode 100644 drivers/scsi/cxlflash/superpipe.c >> create mode 100644 drivers/scsi/cxlflash/superpipe.h >> create mode 100644 include/uapi/scsi/cxlflash_ioctl.h >> >> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt >> index fdd35bf..67273e1 100644 >> --- a/Documentation/ioctl/ioctl-number.txt >> +++ b/Documentation/ioctl/ioctl-number.txt >> @@ -315,6 +315,7 @@ Code Seq#(hex) Include File Comments >> 0xC0 00-0F linux/usb/iowarrior.h >> 0xC9 00-0F uapi/cxl-memcpy.h Reserved for non-upstream prototype > > This above doesn't exist upstream. Make sure your patch applies to a > clean tree. My mistake. Will fix in v5. >> + >> +DK_CXLFLASH_USER_VIRTUAL >> +------------------------ >> + This ioctl is responsible for transitioning the LUN to virtual mode >> + of access and configuring the AFU for virtual access from user space >> + on a per-context basis. Additionally, the block size and last logical >> + block address (LBA) are returned to the user. >> + >> + As mentioned previously, when operating in user space access mode, >> + LUNs may be accessed in whole or in part. Only one mode is allowed >> + at a time and if one mode is active (outstanding references exist), >> + requests to use the LUN in a different mode are denied. >> + >> + The AFU is configured for virtual access from user space by adding >> + an entry to the AFU's resource handle table. The index of the entry >> + is treated as a resource handle that is returned to the user. The >> + user is then able to use the handle to reference the LUN during I/O. >> + >> + By default, the virtual LUN is created with a size of 0. The user >> + would need to use the DK_CXLFLASH_VLUN_RESIZE ioctl to adjust the grow >> + the virtual LUN to a desired size. To avoid having to perform this >> + resize for the initial creation of the virtual LUN, the user has the >> + option of specifying a size as part of the DK_CXLFLASH_USER_VIRTUAL >> + ioctl, such that when success is returned to the user, the >> + resource handle that is provided is already referencing provisioned >> + storage. This is reflected by the last LBA being a non-zero value. > > > This should be in the vlun patch. Okay. >> +DK_CXLFLASH_VLUN_RESIZE >> +----------------------- >> + This ioctl is responsible for resizing a previously created virtual >> + LUN and will fail if invoked upon a LUN that is not in virtual >> + mode. Upon success, an updated last LBA is returned to the user >> + indicating the new size of the virtual LUN associated with the >> + resource handle. >> + >> + The partitioning of virtual LUNs is jointly mediated by the cxlflash >> + driver and the AFU. An allocation table is kept for each LUN that is >> + operating in the virtual mode and used to program a LUN translation >> + table that the AFU references when provided with a resource handle. > > All this vlun discussion would be in the next patch not this superpipe patch. Ditto, will do. >> +DK_CXLFLASH_RELEASE >> +------------------- >> + This ioctl is responsible for releasing a previously obtained >> + reference to either a physical or virtual LUN. This can be >> + thought of as the inverse of the DK_CXLFLASH_USER_DIRECT or >> + DK_CXLFLASH_USER_VIRTUAL ioctls. Upon success, the resource handle >> + is no longer valid and the entry in the resource handle table is >> + made available to be used again. >> + >> + As part of the release process for virtual LUNs, the virtual LUN >> + is first resized to 0 to clear out and free the translation tables >> + associated with the virtual LUN reference. > > Looks like file_ops release calls these functions anyway. So why do we need > this? This is required because a user may hold multiple resource handles per context and may wish to release some of those resource handles while not relinquishing all of them (which would occur [along with the context too] when using the file_ops release). In short, this release is the opposite of user_virtual/user_physical. > >> +DK_CXLFLASH_DETACH >> +------------------ >> + This ioctl is responsible for unregistering a context with the >> + cxlflash driver and release outstanding resources that were >> + not explicitly released via the DK_CXLFLASH_RELEASE ioctl. Upon >> + success, all "tokens" which had been provided to the user from the >> + DK_CXLFLASH_ATTACH onward are no longer valid. > > Why split this between detach and release? Can you reused a released context? They are for ‘undoing’ different things. The release is for freeing resource handles while detach is for freeing contexts. A user may wish to allocate a context only once but create/free resources associated with that context many times without ever freeing the context. >> +DK_CXLFLASH_CLONE >> +----------------- >> + This ioctl is responsible for cloning a previously created >> + context to a more recently created context. It exists solely to >> + support maintaining user space access to storage after a process >> + forks. Upon success, the child process (which invoked the ioctl) >> + will have access to the same LUNs via the same resource handle(s) >> + and fd2 as the parent, but under a different context. >> + >> + Context sharing across processes is not supported with CXL and >> + therefore each fork must be met with establishing a new context >> + for the child process. This ioctl simplifies the state management >> + and playback required by a user in such a scenario. When a process >> + forks, child process can clone the parents context by first creating >> + a context (via DK_CXLFLASH_ATTACH) and then using this ioctl to >> + perform the clone from the parent to the child. >> + >> + The clone itself is fairly simple. The resource handle and lun >> + translation tables are copied from the parent context to the child's >> + and then synced with the AFU. > > This should be in the vlun patch. Okay. > Also, should be called DK_CXLFLASH_VLUN_CLONE to be consisten with VLUN_RESIZE Sure, we can accommodate that. >> + >> +DK_CXLFLASH_VERIFY >> +------------------ >> + This ioctl is used to detect various changes such as the capacity of >> + the disk changing, the number of LUNs visible changing, etc. In cases >> + where the changes affect the application (such as a LUN resize), the >> + cxlflash driver will report the changed state to the application. >> > > This needs a broader description. Verify exactly what? The user calls in when they want to validate that a LUN hasn’t been changed in response to a check condition. As the user is operating out of band from the kernel, they will see these types of events without our knowledge. When encountered, the user’s architected behavior is to call in to this ioctl, indicating what they want to verify and passing along any appropriate information. For now, we only support verifying a LUN change (ie: size different) with sense data. Will include these types of details here in v5. >> +DK_CXLFLASH_RECOVER_AFU >> +----------------------- >> + This ioctl is used to drive recovery (if such an action is warranted) >> + of a specified user context. Any state associated with the user context >> + is re-established upon successful recovery. > > Why would I call this? What scenario? User contexts are put into an error condition when the device needs to be reset or is terminating. Users are notified of this error condition by seeing all 0xF’s on an MMIO read. Upon encountering this, the architected behavior for a user is to call into this ioctl to recover their context. A user may also call into this ioctl at any time to check if the device is operating normally. If a failure is returned from this ioctl, the user is expected to gracefully clean up their context via release/detach ioctls. Until they do, the context they hold is not relinquished. The user may also optionally exit the process at which time the context/resources they held will be freed as part of the release fop. Will include these types of details here in v5. >> + >> +DK_CXLFLASH_MANAGE_LUN >> +---------------------- >> + This ioctl is used to switch a LUN from a mode where it is available >> + for file-system access (legacy), to a mode where it is set aside for >> + exclusive user space access (superpipe). In case a LUN is visible >> + across multiple ports and adapters, this ioctl is used to uniquely >> + identify each LUN by its World Wide Node Name (WWNN). > > Should this be called something specific? DK_CXLFLASH_SUPERPIPE_MODE? We intentionally gave this ioctl a somewhat vague/broader name as we may use it to perform more actions than simply enabling/disabling superpipe mode in the future. >> + atomic_t recovery_threads; >> + struct mutex ctx_recovery_mutex; >> + struct mutex ctx_tbl_list_mutex; >> + struct ctx_info *ctx_tbl[MAX_CONTEXT]; > > MAX_CONTEXT=512. This is pretty big! It is. But then again, it’s much better than when we started and had an array of struct ctx_info. We can possibly look at coming up with a more dynamic solution in the future if memory footprint is of great concern. Right now I think our footprint is pretty good overall. >> +/** >> + * lookup_lun() - find or create a local LUN information structure >> + * @sdev: SCSI device associated with LUN. >> + * @wwid: WWID associated with LUN. >> + * >> + * When a local LUN is not found and a global LUN is also not found, both >> + * a global LUN and local LUN are created. The global LUN is added to the >> + * global list and the local LUN is returned. >> + * >> + * Return: Found/Allocated local lun_info structure on success, NULL on failure >> + */ >> +static struct llun_info *lookup_lun(struct scsi_device *sdev, u8 *wwid) > > Should this been lookup_and_create_lun()? lookup_lun() does something quite > different to lookup_local() despite being named simlarly. Ben Herrenschmidt had a similar comment. We’re going to look at renaming these. >> +/** >> + * cxlflash_term_luns() - Delete all entries from local lun list, free. >> + * @cfg: Internal structure associated with the host. >> + */ >> +void cxlflash_term_luns(struct cxlflash_cfg *cfg) > > This just does local luns? Will look at renaming this. >> +/** >> + * cxlflash_stop_term_user_contexts() - stops/terminates known user contexts >> + * @cfg: Internal structure associated with the host. >> + * >> + * When the host needs to go down, all users must be quiesced and their >> + * memory freed. This is accomplished by putting the contexts in error >> + * state which will notify the user and let them 'drive' the teardown. >> + * Meanwhile, this routine camps until all user contexts have been removed. >> + */ >> +void cxlflash_stop_term_user_contexts(struct cxlflash_cfg *cfg) >> +{ >> + int i, found; >> + >> + cxlflash_mark_contexts_error(cfg); >> + >> + while (true) { >> + found = false; >> + >> + for (i = 0; i < MAX_CONTEXT; i++) >> + if (cfg->ctx_tbl[i]) { >> + found = true; >> + break; >> + } >> + >> + if (!found && list_empty(&cfg->ctx_err_recovery)) >> + return; >> + >> + pr_debug("%s: Wait for user context to quiesce...\n", __func__); >> + wake_up_all(&cfg->limbo_waitq); >> + ssleep(1); > > Why 1 sec and why in a loop? Can’t you poll/wait for completion somewhere? This routine is called when the device is being removed and we need to stall until the user contexts are made aware of removal [by marking them in error] and have completely gone away [there are no longer any contexts in our table or list]. The 1 second sleep is simply to give the users time to cleanup once they see the notification. We can make it a larger/smaller value or remove it entirely, but I felt that since this is not a hot path there was no reason to perform a busy-wait style of loop here and yield to someone else. As for the completion/poll, I did consider that but couldn’t come up with a clean way of implementing given how we designed the notification/cleanup mechanism (really just a failed recovery). We can certainly look at doing that as an enhancement in the future. >> + if (likely(ctxid < MAX_CONTEXT)) { >> +retry: >> + rc = mutex_lock_interruptible(&cfg->ctx_tbl_list_mutex); >> + if (rc) >> + goto out; >> + >> + ctxi = cfg->ctx_tbl[ctxid]; >> + if (ctxi) >> + if ((file && (ctxi->file != file)) || >> + (!file && (ctxi->ctxid != rctxid))) >> + ctxi = NULL; >> + >> + if ((ctx_ctrl & CTX_CTRL_ERR) || >> + (!ctxi && (ctx_ctrl & CTX_CTRL_ERR_FALLBACK))) >> + ctxi = find_error_context(cfg, rctxid, file); >> + if (!ctxi) { >> + mutex_unlock(&cfg->ctx_tbl_list_mutex); >> + goto out; >> + } >> + >> + /* >> + * Need to acquire ownership of the context while still under >> + * the table/list lock to serialize with a remove thread. Use >> + * the 'try' to avoid stalling the table/list lock for a single >> + * context. >> + */ >> + rc = mutex_trylock(&ctxi->mutex); >> + mutex_unlock(&cfg->ctx_tbl_list_mutex); >> + if (!rc) >> + goto retry; > > Please just create a loop rather than this goto retry. Okay. >> + rhte_checkin(ctxi, rhte); >> + cxlflash_lun_detach(gli); >> + >> +out: >> + if (unlock_ctx) >> + mutex_unlock(&ctxi->mutex); > > Where is the matching lock for this? One of the main design points of our context serialization strategy is that any context returned from get_context() has been fully validated, will not be removed from under you, and _is holding the context mutex_. Thus for each of these mutex_unlock(ctxi) you see at the bottom of external entry points, the mutex was obtained in get_context(). >> + char *tmp = NULL; >> + size_t size; >> + struct afu *afu = cfg->afu; >> + struct ctx_info *ctxi = NULL; >> + struct sisl_rht_entry *rhte; >> + >> + size = (MAX_RHT_PER_CONTEXT * sizeof(*ctxi->rht_lun)); >> + size += sizeof(*ctxi); >> + >> + tmp = kzalloc(size, GFP_KERNEL); > > Just do two allocs. One for ctxi and one for rht_lun. This is overly > complicated. I disagree that it’s overly complicated. The intention here was to get this stuff together to avoid multiple function calls and possibly help out perf-wise via locality. That said, I’m not opposed to performing multiple allocations. Look for this in v5. > >> + if (unlikely(!tmp)) { >> + pr_err("%s: Unable to allocate context! (%ld)\n", >> + __func__, size); >> + goto out; >> + } >> + >> + rhte = (struct sisl_rht_entry *)get_zeroed_page(GFP_KERNEL); >> + if (unlikely(!rhte)) { >> + pr_err("%s: Unable to allocate RHT!\n", __func__); >> + goto err; >> + } >> + >> + ctxi = (struct ctx_info *)tmp; >> + tmp += sizeof(*ctxi); >> + ctxi->rht_lun = (struct llun_info **)tmp; > > Yuck… just do two allocs rather than this throbbing. You got it. >> +out: >> + if (unlock_ctx) >> + mutex_unlock(&ctxi->mutex); > > Where is the matching lock for this? See earlier comment about get_context(). >> + if (likely(ctxi)) >> + mutex_unlock(&ctxi->mutex); > > Where is the matching lock for this? Ditto. >> + file = cxl_get_fd(ctx, &cfg->cxl_fops, &fd); > > You should create a new fops for each call here. We write the fops to fill it > out. I think it’ll work as you have now but it's a bit dodgy. Are you saying that instead of having a single fops for the device, each of our context’s should have an fops that we pass here? >> + list_add(&lun_access->list, &ctxi->luns); >> + mutex_unlock(&ctxi->mutex); > > Where is the matching lock for this? Just like with get_context(), we leave create_context() with the mutex held. >> + rc = cxlflash_lun_attach(gli, MODE_PHYSICAL); >> + if (unlikely(rc)) { >> + dev_err(dev, "%s: Failed to attach to LUN! (PHYSICAL)\n", > > Is this going to spam the console from userspace? Same below. With a well-behaved application it shouldn’t We can certainly look at moving to a debug if you feel it’s likely that someone would use this to spam the console. >> + struct cxlflash_cfg *cfg = (struct cxlflash_cfg *)sdev->host->hostdata; >> + struct afu *afu = cfg->afu; >> + struct dk_cxlflash_hdr *hdr; >> + char buf[MAX_CXLFLASH_IOCTL_SZ]; > > why is buf not just a “union cxlflash_ioctls"? Because I wanted you to have to look up this define? ;) In all seriousness, we originally had this define as a stand-alone value (it wasn’t tied to the union). When I implemented the union, I figured the define was more self-descriptive in what we were trying to convey. I’ll change it over to the union in v5. >> + hdr = (struct dk_cxlflash_hdr *)&buf; >> + if (hdr->version != 0) { >> + pr_err("%s: Version %u not supported for %s\n", >> + __func__, hdr->version, decode_ioctl(cmd)); >> + rc = -EINVAL; >> + goto cxlflash_ioctl_exit; >> + } > > Do you advertise this version anywhere? Users just have to call it and fail? We don’t. We can add a version define to the exported ioctl header. > You should check hdr->flags are zero incase some new userspace tries to set > them. Same for hdr->rsvd. We can’t do that for the flags because those they are used for some of the ioctls. The reserved’s however _can_ be checked under the version 0 clause. > Also, can you do these checks earlier. It seems you've already done a bunch of > stuff before here. >From a runtime perspective, we haven’t actually done that much prior to the version check. I suppose we could put the check earlier but I don’t like the idea of touching data that hasn’t been copied in. So we would need to copy in just the header, then check the version. Then if it’s valid, copy in the rest. At some point later on if/when more versions are supported we might need to do something like this, but I think it seems a bit silly to do that now. > >> + >> + rc = do_ioctl(sdev, (void *)&buf); >> + if (likely(!rc)) >> + if (unlikely(copy_to_user(arg, &buf, size))) { >> + pr_err("%s: copy_to_user() fail! " >> + "size=%lu cmd=%d (%s) arg=%p\n", >> + __func__, size, cmd, decode_ioctl(cmd), arg); >> + rc = -EFAULT; >> + } >> + >> + /* fall through to exit */ >> + >> +cxlflash_ioctl_exit: >> + if (unlikely(rc && known_ioctl)) >> + pr_err("%s: ioctl %s (%08X) on dev(%d/%d/%d/%llu) " >> + "returned rc %d\n", __func__, >> + decode_ioctl(cmd), cmd, shost->host_no, >> + sdev->channel, sdev->id, sdev->lun, rc); >> + else >> + pr_debug("%s: ioctl %s (%08X) on dev(%d/%d/%d/%llu) " >> + "returned rc %d\n", __func__, decode_ioctl(cmd), >> + cmd, shost->host_no, sdev->channel, sdev->id, >> + sdev->lun, rc); >> + return rc; >> +} >> + > > git am complains about this trailing new line. Consider it removed then. >> +#define DK_CXLFLASH_ATTACH CXL_IOWR(0x80, dk_cxlflash_attach) >> +#define DK_CXLFLASH_USER_DIRECT CXL_IOWR(0x81, dk_cxlflash_udirect) >> +#define DK_CXLFLASH_RELEASE CXL_IOWR(0x84, dk_cxlflash_release) >> +#define DK_CXLFLASH_DETACH CXL_IOWR(0x85, dk_cxlflash_detach) >> +#define DK_CXLFLASH_VERIFY CXL_IOWR(0x86, dk_cxlflash_verify) >> +#define DK_CXLFLASH_RECOVER_AFU CXL_IOWR(0x88, dk_cxlflash_recover_afu) >> +#define DK_CXLFLASH_MANAGE_LUN CXL_IOWR(0x89, dk_cxlflash_manage_lun) > > I'm not sure I'd leave these sparse. What happens if the vlun patches don't > get in? I do agree with you. I had wanted to keep them like this as their ordering matched closer with how they are expected to be used. That said, I’m okay with moving them around to avoid the sparseness. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html