Re: [PATCH v4 2/3] cxlflash: Superpipe support

"Matthew R. Ochs" <mrochs@xxxxxxxxxxxxxxxxxx> · Tue, 11 Aug 2015 16:51:46 -0500

> On Aug 11, 2015, at 12:23 AM, Michael Neuling <mikey@xxxxxxxxxxx> wrote:
> 
> Some comments inline

Thanks for reviewing, responses below.

> On Mon, 2015-08-10 at 12:09 -0500, Matthew R. Ochs wrote:
>> Add superpipe supporting infrastructure to device driver for the IBM CXL
>> Flash adapter. This patch allows userspace applications to take advantage
>> of the accelerated I/O features that this adapter provides and bypass the
>> traditional filesystem stack.
>> 
>> Signed-off-by: Matthew R. Ochs <mrochs@xxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Manoj N. Kumar <manoj@xxxxxxxxxxxxxxxxxx>
>> ---
>> Documentation/ioctl/ioctl-number.txt |    1 +
>> Documentation/powerpc/cxlflash.txt   |  297 +++++
>> drivers/scsi/cxlflash/Makefile       |    2 +-
>> drivers/scsi/cxlflash/common.h       |   19 +
>> drivers/scsi/cxlflash/main.c         |   21 +-
>> drivers/scsi/cxlflash/superpipe.c    | 2206 ++++++++++++++++++++++++++++++++++
>> drivers/scsi/cxlflash/superpipe.h    |  127 ++
>> include/uapi/scsi/Kbuild             |    1 +
>> include/uapi/scsi/cxlflash_ioctl.h   |  139 +++
>> 9 files changed, 2810 insertions(+), 3 deletions(-)
>> create mode 100644 Documentation/powerpc/cxlflash.txt
>> create mode 100644 drivers/scsi/cxlflash/superpipe.c
>> create mode 100644 drivers/scsi/cxlflash/superpipe.h
>> create mode 100644 include/uapi/scsi/cxlflash_ioctl.h
>> 
>> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
>> index fdd35bf..67273e1 100644
>> --- a/Documentation/ioctl/ioctl-number.txt
>> +++ b/Documentation/ioctl/ioctl-number.txt
>> @@ -315,6 +315,7 @@ Code  Seq#(hex)	Include File		Comments
>> 0xC0	00-0F	linux/usb/iowarrior.h
>> 0xC9	00-0F	uapi/cxl-memcpy.h       Reserved for non-upstream prototype
> 
> This above doesn't exist upstream.  Make sure your patch applies to a
> clean tree.

My mistake. Will fix in v5.

>> +
>> +DK_CXLFLASH_USER_VIRTUAL
>> +------------------------
>> +    This ioctl is responsible for transitioning the LUN to virtual mode
>> +    of access and configuring the AFU for virtual access from user space
>> +    on a per-context basis. Additionally, the block size and last logical
>> +    block address (LBA) are returned to the user.
>> +
>> +    As mentioned previously, when operating in user space access mode,
>> +    LUNs may be accessed in whole or in part. Only one mode is allowed
>> +    at a time and if one mode is active (outstanding references exist),
>> +    requests to use the LUN in a different mode are denied.
>> +
>> +    The AFU is configured for virtual access from user space by adding
>> +    an entry to the AFU's resource handle table. The index of the entry
>> +    is treated as a resource handle that is returned to the user. The
>> +    user is then able to use the handle to reference the LUN during I/O.
>> +
>> +    By default, the virtual LUN is created with a size of 0. The user
>> +    would need to use the DK_CXLFLASH_VLUN_RESIZE ioctl to adjust the grow
>> +    the virtual LUN to a desired size. To avoid having to perform this
>> +    resize for the initial creation of the virtual LUN, the user has the
>> +    option of specifying a size as part of the DK_CXLFLASH_USER_VIRTUAL
>> +    ioctl, such that when success is returned to the user, the
>> +    resource handle that is provided is already referencing provisioned
>> +    storage. This is reflected by the last LBA being a non-zero value.
> 
> 
> This should be in the vlun patch.

Okay.

>> +DK_CXLFLASH_VLUN_RESIZE
>> +-----------------------
>> +    This ioctl is responsible for resizing a previously created virtual
>> +    LUN and will fail if invoked upon a LUN that is not in virtual
>> +    mode. Upon success, an updated last LBA is returned to the user
>> +    indicating the new size of the virtual LUN associated with the
>> +    resource handle.
>> +
>> +    The partitioning of virtual LUNs is jointly mediated by the cxlflash
>> +    driver and the AFU. An allocation table is kept for each LUN that is
>> +    operating in the virtual mode and used to program a LUN translation
>> +    table that the AFU references when provided with a resource handle.
> 
> All this vlun discussion would be in the next patch not this superpipe patch.

Ditto, will do.

>> +DK_CXLFLASH_RELEASE
>> +-------------------
>> +    This ioctl is responsible for releasing a previously obtained
>> +    reference to either a physical or virtual LUN. This can be
>> +    thought of as the inverse of the DK_CXLFLASH_USER_DIRECT or
>> +    DK_CXLFLASH_USER_VIRTUAL ioctls. Upon success, the resource handle
>> +    is no longer valid and the entry in the resource handle table is
>> +    made available to be used again.
>> +
>> +    As part of the release process for virtual LUNs, the virtual LUN
>> +    is first resized to 0 to clear out and free the translation tables
>> +    associated with the virtual LUN reference.
> 
> Looks like file_ops release calls these functions anyway.  So why do we need
> this?

This is required because a user may hold multiple resource handles
per context and may wish to release some of those resource handles
while not relinquishing all of them (which would occur [along with the
context too] when using the file_ops release). In short, this release
is the opposite of user_virtual/user_physical.

> 
>> +DK_CXLFLASH_DETACH
>> +------------------
>> +    This ioctl is responsible for unregistering a context with the
>> +    cxlflash driver and release outstanding resources that were
>> +    not explicitly released via the DK_CXLFLASH_RELEASE ioctl. Upon
>> +    success, all "tokens" which had been provided to the user from the
>> +    DK_CXLFLASH_ATTACH onward are no longer valid.
> 
> Why split this between detach and release?  Can you reused a released context?

They are for ‘undoing’ different things. The release is for freeing resource
handles while detach is for freeing contexts. A user may wish to allocate
a context only once but create/free resources associated with that context
many times without ever freeing the context.

>> +DK_CXLFLASH_CLONE
>> +-----------------
>> +    This ioctl is responsible for cloning a previously created
>> +    context to a more recently created context. It exists solely to
>> +    support maintaining user space access to storage after a process
>> +    forks. Upon success, the child process (which invoked the ioctl)
>> +    will have access to the same LUNs via the same resource handle(s)
>> +    and fd2 as the parent, but under a different context.
>> +
>> +    Context sharing across processes is not supported with CXL and
>> +    therefore each fork must be met with establishing a new context
>> +    for the child process. This ioctl simplifies the state management
>> +    and playback required by a user in such a scenario. When a process
>> +    forks, child process can clone the parents context by first creating
>> +    a context (via DK_CXLFLASH_ATTACH) and then using this ioctl to
>> +    perform the clone from the parent to the child.
>> +
>> +    The clone itself is fairly simple. The resource handle and lun
>> +    translation tables are copied from the parent context to the child's
>> +    and then synced with the AFU.
> 
> This should be in the vlun patch.

Okay.

> Also, should be called DK_CXLFLASH_VLUN_CLONE to be consisten with VLUN_RESIZE

Sure, we can accommodate that.

>> +
>> +DK_CXLFLASH_VERIFY
>> +------------------
>> +    This ioctl is used to detect various changes such as the capacity of
>> +    the disk changing, the number of LUNs visible changing, etc. In cases
>> +    where the changes affect the application (such as a LUN resize), the
>> +    cxlflash driver will report the changed state to the application.
>> 
> 
> This needs a broader description.  Verify exactly what?

The user calls in when they want to validate that a LUN hasn’t been changed
in response to a check condition. As the user is operating out of band from
the kernel, they will see these types of events without our knowledge. When
encountered, the user’s architected behavior is to call in to this ioctl, indicating
what they want to verify and passing along any appropriate information. For
now, we only support verifying a LUN change (ie: size different) with sense data.

Will include these types of details here in v5.

>> +DK_CXLFLASH_RECOVER_AFU
>> +-----------------------
>> +    This ioctl is used to drive recovery (if such an action is warranted)
>> +    of a specified user context. Any state associated with the user context
>> +    is re-established upon successful recovery.
> 
> Why would I call this?  What scenario?

User contexts are put into an error condition when the device needs to be
reset or is terminating. Users are notified of this error condition by seeing
all 0xF’s on an MMIO read. Upon encountering this, the architected behavior
for a user is to call into this ioctl to recover their context. A user may also
call into this ioctl at any time to check if the device is operating normally.
If a failure is returned from this ioctl, the user is expected to gracefully
clean up their context via release/detach ioctls. Until they do, the context
they hold is not relinquished. The user may also optionally exit the process
at which time the context/resources they held will be freed as part of the
release fop.

Will include these types of details here in v5.

>> +
>> +DK_CXLFLASH_MANAGE_LUN
>> +----------------------
>> +    This ioctl is used to switch a LUN from a mode where it is available
>> +    for file-system access (legacy), to a mode where it is set aside for
>> +    exclusive user space access (superpipe). In case a LUN is visible
>> +    across multiple ports and adapters, this ioctl is used to uniquely
>> +    identify each LUN by its World Wide Node Name (WWNN).
> 
> Should this be called something specific?  DK_CXLFLASH_SUPERPIPE_MODE?

We intentionally gave this ioctl a somewhat vague/broader name as we may use it
to perform more actions than simply enabling/disabling superpipe mode in the future.

>> +	atomic_t recovery_threads;
>> +	struct mutex ctx_recovery_mutex;
>> +	struct mutex ctx_tbl_list_mutex;
>> +	struct ctx_info *ctx_tbl[MAX_CONTEXT];
> 
> MAX_CONTEXT=512.  This is pretty big!

It is. But then again, it’s much better than when we started and had an array
of struct ctx_info. We can possibly look at coming up with a more dynamic
solution in the future if memory footprint is of great concern. Right now I
think our footprint is pretty good overall.

>> +/**
>> + * lookup_lun() - find or create a local LUN information structure
>> + * @sdev:	SCSI device associated with LUN.
>> + * @wwid:	WWID associated with LUN.
>> + *
>> + * When a local LUN is not found and a global LUN is also not found, both
>> + * a global LUN and local LUN are created. The global LUN is added to the
>> + * global list and the local LUN is returned.
>> + *
>> + * Return: Found/Allocated local lun_info structure on success, NULL on failure
>> + */
>> +static struct llun_info *lookup_lun(struct scsi_device *sdev, u8 *wwid)
> 
> Should this been lookup_and_create_lun()?  lookup_lun() does something quite
> different to lookup_local() despite being named simlarly.

Ben Herrenschmidt had a similar comment. We’re going to look at renaming these.

>> +/**
>> + * cxlflash_term_luns() - Delete all entries from local lun list, free.
>> + * @cfg:	Internal structure associated with the host.
>> + */
>> +void cxlflash_term_luns(struct cxlflash_cfg *cfg)
> 
> This just does local luns?

Will look at renaming this.

>> +/**
>> + * cxlflash_stop_term_user_contexts() - stops/terminates known user contexts
>> + * @cfg:	Internal structure associated with the host.
>> + *
>> + * When the host needs to go down, all users must be quiesced and their
>> + * memory freed. This is accomplished by putting the contexts in error
>> + * state which will notify the user and let them 'drive' the teardown.
>> + * Meanwhile, this routine camps until all user contexts have been removed.
>> + */
>> +void cxlflash_stop_term_user_contexts(struct cxlflash_cfg *cfg)
>> +{
>> +	int i, found;
>> +
>> +	cxlflash_mark_contexts_error(cfg);
>> +
>> +	while (true) {
>> +		found = false;
>> +
>> +		for (i = 0; i < MAX_CONTEXT; i++)
>> +			if (cfg->ctx_tbl[i]) {
>> +				found = true;
>> +				break;
>> +			}
>> +
>> +		if (!found && list_empty(&cfg->ctx_err_recovery))
>> +			return;
>> +
>> +		pr_debug("%s: Wait for user context to quiesce...\n", __func__);
>> +		wake_up_all(&cfg->limbo_waitq);
>> +		ssleep(1);
> 
> Why 1 sec and why in a loop?  Can’t you poll/wait for completion somewhere?

This routine is called when the device is being removed and we need to stall until
the user contexts are made aware of removal [by marking them in error] and have
completely gone away [there are no longer any contexts in our table or list].

The 1 second sleep is simply to give the users time to cleanup once they see
the notification. We can make it a larger/smaller value or remove it entirely, but
I felt that since this is not a hot path there was no reason to perform a busy-wait
style of loop here and yield to someone else.

As for the completion/poll, I did consider that but couldn’t come up with a clean
way of implementing given how we designed the notification/cleanup mechanism
(really just a failed recovery). We can certainly look at doing that as an
enhancement in the future.

>> +	if (likely(ctxid < MAX_CONTEXT)) {
>> +retry:
>> +		rc = mutex_lock_interruptible(&cfg->ctx_tbl_list_mutex);
>> +		if (rc)
>> +			goto out;
>> +
>> +		ctxi = cfg->ctx_tbl[ctxid];
>> +		if (ctxi)
>> +			if ((file && (ctxi->file != file)) ||
>> +			    (!file && (ctxi->ctxid != rctxid)))
>> +				ctxi = NULL;
>> +
>> +		if ((ctx_ctrl & CTX_CTRL_ERR) ||
>> +		    (!ctxi && (ctx_ctrl & CTX_CTRL_ERR_FALLBACK)))
>> +			ctxi = find_error_context(cfg, rctxid, file);
>> +		if (!ctxi) {
>> +			mutex_unlock(&cfg->ctx_tbl_list_mutex);
>> +			goto out;
>> +		}
>> +
>> +		/*
>> +		 * Need to acquire ownership of the context while still under
>> +		 * the table/list lock to serialize with a remove thread. Use
>> +		 * the 'try' to avoid stalling the table/list lock for a single
>> +		 * context.
>> +		 */
>> +		rc = mutex_trylock(&ctxi->mutex);
>> +		mutex_unlock(&cfg->ctx_tbl_list_mutex);
>> +		if (!rc)
>> +			goto retry;
> 
> Please just create a loop rather than this goto retry.

Okay.

>> +	rhte_checkin(ctxi, rhte);
>> +	cxlflash_lun_detach(gli);
>> +
>> +out:
>> +	if (unlock_ctx)
>> +		mutex_unlock(&ctxi->mutex);
> 
> Where is the matching lock for this?

One of the main design points of our context serialization strategy is that
any context returned from get_context() has been fully validated, will not
be removed from under you, and _is holding the context mutex_. Thus
for each of these mutex_unlock(ctxi) you see at the bottom of external
entry points, the mutex was obtained in get_context().

>> +	char *tmp = NULL;
>> +	size_t size;
>> +	struct afu *afu = cfg->afu;
>> +	struct ctx_info *ctxi = NULL;
>> +	struct sisl_rht_entry *rhte;
>> +
>> +	size = (MAX_RHT_PER_CONTEXT * sizeof(*ctxi->rht_lun));
>> +	size += sizeof(*ctxi);
>> +
>> +	tmp = kzalloc(size, GFP_KERNEL);
> 
> Just do two allocs. One for ctxi and one for rht_lun.  This is overly
> complicated.

I disagree that it’s overly complicated. The intention here was to get this
stuff together to avoid multiple function calls and possibly help out perf-wise
via locality. That said, I’m not opposed to performing multiple allocations.

Look for this in v5.

> 
>> +	if (unlikely(!tmp)) {
>> +		pr_err("%s: Unable to allocate context! (%ld)\n",
>> +		       __func__, size);
>> +		goto out;
>> +	}
>> +
>> +	rhte = (struct sisl_rht_entry *)get_zeroed_page(GFP_KERNEL);
>> +	if (unlikely(!rhte)) {
>> +		pr_err("%s: Unable to allocate RHT!\n", __func__);
>> +		goto err;
>> +	}
>> +
>> +	ctxi = (struct ctx_info *)tmp;
>> +	tmp += sizeof(*ctxi);
>> +	ctxi->rht_lun = (struct llun_info **)tmp;
> 
> Yuck… just do two allocs rather than this throbbing.

You got it.

>> +out:
>> +	if (unlock_ctx)
>> +		mutex_unlock(&ctxi->mutex);
> 
> Where is the matching lock for this?

See earlier comment about get_context().

>> +	if (likely(ctxi))
>> +		mutex_unlock(&ctxi->mutex);
> 
> Where is the matching lock for this?

Ditto.

>> +	file = cxl_get_fd(ctx, &cfg->cxl_fops, &fd);
> 
> You should create a new fops for each call here.  We write the fops to fill it
> out.  I think it’ll work as you have now but it's a bit dodgy.

Are you saying that instead of having a single fops for the device, each of our
context’s should have an fops that we pass here?

>> +	list_add(&lun_access->list, &ctxi->luns);
>> +	mutex_unlock(&ctxi->mutex);
> 
> Where is the matching lock for this?

Just like with get_context(), we leave create_context() with the mutex held.

>> +	rc = cxlflash_lun_attach(gli, MODE_PHYSICAL);
>> +	if (unlikely(rc)) {
>> +		dev_err(dev, "%s: Failed to attach to LUN! (PHYSICAL)\n",
> 
> Is this going to spam the console from userspace?  Same below.

With a well-behaved application it shouldn’t We can certainly look at moving
to a debug if you feel it’s likely that someone would use this to spam the
console.

>> +	struct cxlflash_cfg *cfg = (struct cxlflash_cfg *)sdev->host->hostdata;
>> +	struct afu *afu = cfg->afu;
>> +	struct dk_cxlflash_hdr *hdr;
>> +	char buf[MAX_CXLFLASH_IOCTL_SZ];
> 
> why is buf not just a “union cxlflash_ioctls"?

Because I wanted you to have to look up this define? ;)

In all seriousness, we originally had this define as a stand-alone value (it
wasn’t tied to the union). When I implemented the union, I figured the define
was more self-descriptive in what we were trying to convey. I’ll change it
over to the union in v5.

>> +	hdr = (struct dk_cxlflash_hdr *)&buf;
>> +	if (hdr->version != 0) {
>> +		pr_err("%s: Version %u not supported for %s\n",
>> +		       __func__, hdr->version, decode_ioctl(cmd));
>> +		rc = -EINVAL;
>> +		goto cxlflash_ioctl_exit;
>> +	}
> 
> Do you advertise this version anywhere?  Users just have to call it and fail?

We don’t. We can add a version define to the exported ioctl header.

> You should check hdr->flags are zero incase some new userspace tries to set
> them.  Same for hdr->rsvd.

We can’t do that for the flags because those they are used for some of the ioctls.
The reserved’s however _can_ be checked under the version 0 clause.

> Also, can you do these checks earlier.  It seems you've already done a bunch of
> stuff before here.

>From a runtime perspective, we haven’t actually done that much prior to the version
check. I suppose we could put the check earlier but I don’t like the idea of touching
data that hasn’t been copied in. So we would need to copy in just the header, then
check the version. Then if it’s valid, copy in the rest. At some point later on if/when more
versions are supported we might need to do something like this, but I think it seems a
bit silly to do that now.

> 
>> +
>> +	rc = do_ioctl(sdev, (void *)&buf);
>> +	if (likely(!rc))
>> +		if (unlikely(copy_to_user(arg, &buf, size))) {
>> +			pr_err("%s: copy_to_user() fail! "
>> +			       "size=%lu cmd=%d (%s) arg=%p\n",
>> +			       __func__, size, cmd, decode_ioctl(cmd), arg);
>> +			rc = -EFAULT;
>> +		}
>> +
>> +	/* fall through to exit */
>> +
>> +cxlflash_ioctl_exit:
>> +	if (unlikely(rc && known_ioctl))
>> +		pr_err("%s: ioctl %s (%08X) on dev(%d/%d/%d/%llu) "
>> +		       "returned rc %d\n", __func__,
>> +		       decode_ioctl(cmd), cmd, shost->host_no,
>> +		       sdev->channel, sdev->id, sdev->lun, rc);
>> +	else
>> +		pr_debug("%s: ioctl %s (%08X) on dev(%d/%d/%d/%llu) "
>> +			 "returned rc %d\n", __func__, decode_ioctl(cmd),
>> +			 cmd, shost->host_no, sdev->channel, sdev->id,
>> +			 sdev->lun, rc);
>> +	return rc;
>> +}
>> +
> 
> git am complains about this trailing new line.

Consider it removed then.

>> +#define DK_CXLFLASH_ATTACH		CXL_IOWR(0x80, dk_cxlflash_attach)
>> +#define DK_CXLFLASH_USER_DIRECT		CXL_IOWR(0x81, dk_cxlflash_udirect)
>> +#define DK_CXLFLASH_RELEASE		CXL_IOWR(0x84, dk_cxlflash_release)
>> +#define DK_CXLFLASH_DETACH		CXL_IOWR(0x85, dk_cxlflash_detach)
>> +#define DK_CXLFLASH_VERIFY		CXL_IOWR(0x86, dk_cxlflash_verify)
>> +#define DK_CXLFLASH_RECOVER_AFU		CXL_IOWR(0x88, dk_cxlflash_recover_afu)
>> +#define DK_CXLFLASH_MANAGE_LUN		CXL_IOWR(0x89, dk_cxlflash_manage_lun)
> 
> I'm not sure I'd leave these sparse.  What happens if the vlun patches don't
> get in?

I do agree with you. I had wanted to keep them like this as their ordering matched
closer with how they are expected to be used. That said, I’m okay with moving
them around to avoid the sparseness.
--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html