Re: [PATCH v6 2/3] cxlflash: Superpipe support

"Matthew R. Ochs" <mrochs@xxxxxxxxxxxxxxxxxx> · Tue, 8 Sep 2015 11:48:52 -0500

Brian,

Thanks for reviewing. Comments inline below. We’ll try to address most
of these in the 4.3 rc phase.  

-matt

> On Aug 26, 2015, at 5:26 PM, Brian King <brking@xxxxxxxxxxxxxxxxxx> wrote:
> 
> Hi Matt,
> 
> Some comments below. I'm happy to see this one get merged as-is, as long as
> these issues can get addressed in a follow on patch.
> 
> Reviewed-by: Brian King <brking@xxxxxxxxxxxxxxxxxx>
> 
> On 08/13/2015 09:47 PM, Matthew R. Ochs wrote:
>> +/**
>> + * cxlflash_term_global_luns() - frees resources associated with global LUN list
>> + */
>> +void cxlflash_term_global_luns(void)
>> +{
>> +	struct glun_info *gli, *temp;
>> +
>> +	mutex_lock(&global.mutex);
>> +	list_for_each_entry_safe(gli, temp, &global.gluns, list) {
>> +		list_del(&gli->list);
>> +		kfree(gli);
> 
> Comments below regarding refcounting this.

Okay.

>> +	}
>> +	mutex_unlock(&global.mutex);
>> +}
>> +
>> +/**
>> + * cxlflash_manage_lun() - handles LUN management activities
>> + * @sdev:	SCSI device associated with LUN.
>> + * @manage:	Manage ioctl data structure.
>> + *
>> + * This routine is used to notify the driver about a LUN's WWID and associate
>> + * SCSI devices (sdev) with a global LUN instance. Additionally it serves to
>> + * change a LUN's operating mode: legacy or superpipe.
>> + *
>> + * Return: 0 on success, -errno on failure
>> + */
>> +int cxlflash_manage_lun(struct scsi_device *sdev,
>> +			struct dk_cxlflash_manage_lun *manage)
>> +{
>> +	int rc = 0;
>> +	struct llun_info *lli = NULL;
>> +	u64 flags = manage->hdr.flags;
>> +	u32 chan = sdev->channel;
>> +
>> +	lli = find_and_create_lun(sdev, manage->wwid);
>> +	pr_debug("%s: ENTER: WWID = %016llX%016llX, flags = %016llX li = %p\n",
>> +		 __func__, get_unaligned_le64(&manage->wwid[0]),
>> +		 get_unaligned_le64(&manage->wwid[8]),
>> +		 manage->hdr.flags, lli);
>> +	if (unlikely(!lli)) {
>> +		rc = -ENOMEM;
>> +		goto out;
>> +	}
>> +
>> +	if (flags & DK_CXLFLASH_MANAGE_LUN_ENABLE_SUPERPIPE) {
>> +		if (lli->newly_created)
>> +			lli->port_sel = CHAN2PORT(chan);
>> +		else
>> +			lli->port_sel = BOTH_PORTS;
>> +		/* Store off lun in unpacked, AFU-friendly format */
>> +		lli->lun_id[chan] = lun_to_lunid(sdev->lun);
>> +		sdev->hostdata = lli;
>> +	} else if (flags & DK_CXLFLASH_MANAGE_LUN_DISABLE_SUPERPIPE) {
>> +		if (lli->parent->mode != MODE_NONE)
>> +			rc = -EBUSY;
>> +		else
>> +			sdev->hostdata = NULL;
> 
> I don't see any locking in this function. What if you had two processes calling
> this ioctl at the same time with different parameters? Could we get into a strange
> state?

There is a lock taken inside the find_and_create_lun() routine, so we’re protected there
in terms of deriving a local lun reference. However we should extend that outside as well
to protect the local lun updates that are made after obtaining the reference.

>> @@ -2439,6 +2470,9 @@ static int __init init_cxlflash(void)
>>  */
>> static void __exit exit_cxlflash(void)
>> {
>> +	cxlflash_term_global_luns();
>> +	cxlflash_free_errpage();
> 
> Both these functions free up memory. I'm concerned that memory could still be in use.
> You probably need to add some refcounting to your global objects so you know when you
> can free them up. kref is your friend.

I don’t think there’s an issue if we put these after the call to pci_unregister_driver() as that
will tear down any remaining users of this global memory. That said, I do agree with the value
of kref and we will look into using it for the global luns and other objects. The error page I view
as more of a ‘allocate once and leave it’ type of object due to the fact that it is relied upon in
error paths.

> 
>> +
>> 	pci_unregister_driver(&cxlflash_driver);
>> }
>> 
>> diff --git a/drivers/scsi/cxlflash/sislite.h b/drivers/scsi/cxlflash/sislite.h
>> index bf5d399..66b8891 100644
>> --- a/drivers/scsi/cxlflash/sislite.h
>> +++ b/drivers/scsi/cxlflash/sislite.h
>> @@ -409,7 +409,10 @@ struct sisl_lxt_entry {
>> 
>> };
>> 
>> -/* Per the SISlite spec, RHT entries are to be 16-byte aligned */
>> +/*
>> + * RHT - Resource Handle Table
>> + * Per the SISlite spec, RHT entries are to be 16-byte aligned
>> + */
>> struct sisl_rht_entry {
>> 	struct sisl_lxt_entry *lxt_start;
>> 	u32 lxt_cnt;
> 
>> +/**
>> + * cxlflash_stop_term_user_contexts() - stops/terminates known user contexts
>> + * @cfg:	Internal structure associated with the host.
>> + *
>> + * When the host needs to go down, all users must be quiesced and their
>> + * memory freed. This is accomplished by putting the contexts in error
>> + * state which will notify the user and let them 'drive' the tear-down.
>> + * Meanwhile, this routine camps until all user contexts have been removed.
> 
> Can you clarify this a bit more? What if the user ignores this notification?
> Perhaps the process has been sent a SIGSTOP or the process is misbehaved.
> Will we ever exit this loop? Are we actually dependent on the user process to
> do something to exit this loop?

The ’notification’ the user receives will be their inability to function. So the user
could ignore but they won’t be able to make any forward progress until they
recover.

If the process goes away we are notified (release() called on exit) and will take
the appropriate cleanup actions. In such a case the loop would exit.

We have discussed a timeout approach in the past for the case where the user
never comes back but were unable to arrive at a value or the best manner in
which to tell the user they’re being permanently removed. We can have those
discussions again and look at arriving at something safer.

>> +static int read_cap16(struct scsi_device *sdev, struct llun_info *lli)
>> +{
>> +	struct cxlflash_cfg *cfg = (struct cxlflash_cfg *)sdev->host->hostdata;
>> +	struct device *dev = &cfg->dev->dev;
>> +	struct glun_info *gli = lli->parent;
>> +	u8 *cmd_buf = NULL;
>> +	u8 *scsi_cmd = NULL;
>> +	u8 *sense_buf = NULL;
>> +	int rc = 0;
>> +	int result = 0;
>> +	int retry_cnt = 0;
>> +	u32 tout = (MC_DISCOVERY_TIMEOUT * HZ);
> 
> Looks like you are using only a 5 second timeout for this. sd.c uses 30 seconds. 
> 5 might be a bit on the short side.

Okay, will look at using a larger value.

> 
>> +
>> +/**
>> + * cxlflash_lun_detach() - detaches a user from a LUN and resets the LUN's mode
>> + * @gli:	LUN to detach.
>> + */
>> +void cxlflash_lun_detach(struct glun_info *gli)
>> +{
>> +	mutex_lock(&gli->mutex);
>> +	WARN_ON(gli->mode == MODE_NONE);
>> +	if (--gli->users == 0)
>> +		gli->mode = MODE_NONE;
> 
> It looks like the only place you free this struct glun_info object is when your
> module gets unloaded. Would be nice if it got freed up when it was no longer in
> use, but to do this will require some re-work on how you manage reference counting
> in this object. You probably also want to use a kref to manage the object lifetime
> rather than inventing your own.

I think you’re misinterpreting this code a bit.

This routine (and its counterpart - ‘attach’) keep track of how a known, existing
LUN is being used: either directly (user-mediated physical access) or virtually
(kernel-mediated virtual access). One of the key design points required to support
the virtual LUN feature is that a LUN cannot be accessed simultaneously by users
with differing usage mode requirements. A user with direct access could simply
overwrite storage provisioned for a user with virtual access. Thus, the ‘user’ count
going to 0 is not indicative of the LUN going away, but rather returning to a neutral
state. We’ll take a look at how we could migrate this logic to a kref.

With regard to actually freeing the global lun object when it’s no longer in use, we
will investigate how to best implement this.

>> +	/* Reset the file descriptor to indicate we're on a close() thread */
>> +	ctxi->lfd = -1;
>> +	detach.context_id = ctxi->ctxid;
>> +	list_for_each_entry_safe(lun_access, t, &ctxi->luns, list)
>> +		_cxlflash_disk_detach(lun_access->sdev, ctxi, &detach);
> 
> I think you have a potential oops here. If you have a user context created for
> an scsi device, you then delete the scsi device via sysfs, then, if there is any
> way to go down this path, which I would hope there is, otherwise you have a
> different issue, then you may pass an sdev pointer that now points to freed memory.

Yes, agree there is an issue here.

> 
> There are a couple possible ways you could solve this.
> 
> 1. It might work to use scsi_get_device / scsi_put_device to get / put a reference to the scsi device
>   for each user context associated with it.
> 2. The other option would be to implement a slave_destroy function and have it kill off
>   any open user contexts.

Will likely go with option 1 (prototype looks promising).

>> +		list_for_each_entry(lun_access, &ctxi->luns, list)
>> +			if (lun_access->lli == lli) {
>> +				dev_dbg(dev, "%s: Already attached!\n",
>> +					__func__);
>> +				rc = -EINVAL;
>> +				goto out;
>> +			}
>> +	}
>> +
>> +	lun_access = kzalloc(sizeof(*lun_access), GFP_KERNEL);
>> +	if (unlikely(!lun_access)) {
> 
> You can ditch the unlikely throughout all your ioctls. ioctls are not a performance path.

We’ll look at cutting some of these back, although I think many of them make sense
where they’re at and aren’t hurting anything.

>> +out_attach:
>> +	attach->hdr.return_flags = 0;
>> +	attach->context_id = ctxi->ctxid;
>> +	attach->block_size = gli->blk_len;
>> +	attach->mmio_size = sizeof(afu->afu_map->hosts[0].harea);
>> +	attach->last_lba = gli->max_lba;
>> +	attach->max_xfer = (sdev->host->max_sectors * 512) / gli->blk_len;
> 
> Would be better to use a literal instead of a magic number here.

Agreed, will change.

>> +	/* Test if in error state */
>> +	reg = readq_be(&afu->ctrl_map->mbox_r);
> 
> By the time this read completes, pdev->error_state is already set to pci_channel_io_frozen. If that
> is not the case here, we need to fix that. You should be able to just use pci_channel_offline(pdev) here
> and not need any delay.

The delay is in place because we’re waiting for our own internal state machine
to catch up (flipped to RESET in the EEH error_detected() callback. If we can
find a way to integrate the pci_channel_offline() check to avoid the slight delay
without over complicating things I’d be open to adopting that, however I don’t
see much of an issue with taking a short break here.

I’ll also add that when we adopted this, there was a bug in the CXL vphb
services where they are not propagating the error_state. I believe that was
fixed in the upstreamed CXL EEH code.

> 
>> +	if (reg == -1) {
>> +		dev_dbg(dev, "%s: MMIO read fail! Wait for recovery...\n",
>> +			__func__);
>> +		mutex_unlock(&ctxi->mutex);
>> +		ctxi = NULL;
>> +		ssleep(1);
>> +		rc = check_state(cfg);
>> +		if (unlikely(rc))
>> +			goto out;
>> +		goto retry;
>> +	}
>> +
>> +	dev_dbg(dev, "%s: MMIO working, no recovery required!\n", __func__);
>> +out:

>> +/* Chunk size parms: note sislite minimum chunk size is
>> +   0x10000 LBAs corresponding to a NMASK or 16.
>> +*/
>> +#define MC_CHUNK_SIZE     (1 << MC_RHT_NMASK)	/* in LBAs */
>> +
>> +#define MC_DISCOVERY_TIMEOUT 5  /* 5 secs */
> 
> Seems like the only place you use this is for the read capacity you issue,
> so MC_READ_CAP timeout might be more accurate.

This name was a holdover from when the timeout was used in
various places. Will change it now that its scope is well-defined.

--
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html