Re: iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

Mike Christie <mchristi@xxxxxxxxxx> · Thu, 15 Mar 2018 02:48:42 -0500

On 03/14/2018 04:28 PM, Maxim Patlasov wrote:
> On Wed, Mar 14, 2018 at 12:05 PM, Michael Christie <mchristi@xxxxxxxxxx
> <mailto:mchristi@xxxxxxxxxx>> wrote:
> 
>         On 03/14/2018 01:27 PM, Michael Christie wrote:
>         > On 03/14/2018 01:24 PM, Maxim Patlasov wrote:
>         >> On Wed, Mar 14, 2018 at 11:13 AM, Jason Dillaman <jdillama@xxxxxxxxxx <mailto:jdillama@xxxxxxxxxx>
>         >> <mailto:jdillama@xxxxxxxxxx <mailto:jdillama@xxxxxxxxxx>>> wrote:
>         >>
>         >>     Maxim, can you provide steps for a reproducer?
>         >>
>         >>
>         >> Yes, but it involves adding two artificial delays: one in tcmu-runner
>         >> and another in kernel iscsi. If you're willing to take pains of
>         >
>         > Send the patches for the changes.
>         >
>         >> ...
>         Where you send the patches that add your delays could you send the
>         target side /var/log/tcmu-runner.log with log_level = 4.
>         ...
> 
> 
> Mike, see please patches and /var/log/tcmu-runner.log in attachment.
> 
> Time-line was like this:
> 
> 1) 13:56:31 the client (iscsi-initiator) sends a request leading to
> "Acquired exclusive lock." on gateway.
> 2) 13:56:49 tcmu-runner is suspended by SIGSTOP
> 3) 13:56:50 the client executes:
> 
> dd of=/dev/mapper/mpatha if=/dev/zero oflag=direct bs=1536 count=1 seek=10 &
> dd of=/dev/mapper/mpatha if=/dev/zero oflag=direct bs=2560 count=1 seek=10 &
> dd of=/dev/mapper/mpatha if=/dev/zero oflag=direct bs=3584 count=1 seek=10
> 
> 4) 13:56:51 gateway is detached from client (and neighbor gateways) by
> "iptables ... -j DROP"
> 5) 13:57:06 the client switches to another path, completes these
> requests above
> 6) 13:57:07 the client executes (that another path is still active):
> 
> dd of=/dev/mapper/mpatha if=/dev/urandom oflag=direct bs=3584 count=1
> seek=10
> 
> 7) 13:57:09 tcmu-runner is resumed by SIGCONT
> 8) 13:57:15 tcmu-runner successfully processes the third request (zero
> bs=3584) overwriting newer data.
> 
> 9) verify that newer data was really overwritten:
> 
> # dd if=/dev/mapper/mpatha iflag=direct bs=3584 count=1 skip=10 |od -x
> 1+0 records in
> 1+0 records out
> 3584 bytes (3.6 kB) copied, 0.00232227 s, 1.5 MB/s
> 0000000 0000 0000 0000 0000 0000 0000 0000 0000
> 
> Thanks,
> Maxim
> 
> 
> delay-request-in-kernel.diff
> 
> 
> diff --git a/drivers/target/iscsi/iscsi_target.c b/drivers/target/iscsi/iscsi_target.c
> index 9eb10d3..f48ee2c 100644
> --- a/drivers/target/iscsi/iscsi_target.c
> +++ b/drivers/target/iscsi/iscsi_target.c
> @@ -1291,6 +1291,13 @@ int iscsit_process_scsi_cmd(struct iscsi_conn *conn, struct iscsi_cmd *cmd,
>  
>  	immed_ret = iscsit_handle_immediate_data(cmd, hdr,
>  					cmd->first_burst_len);
> +
> +	if (be32_to_cpu(hdr->data_length) == 3584) {
> +		u64 end_time = ktime_get_ns() + 25ULL * 1000 * 1000 * 1000;
> +		while (ktime_get_ns() < end_time)
> +			schedule_timeout_uninterruptible(HZ);
> +	}
> +


It looks like there is a bug.

1. A regression was added when I stopped killing the iscsi connection
when the lock is taken away from us to handle a failback bug where it
was causing ping ponging. That combined with #2 will cause the bug.

2. I did not anticipate the type of sleeps above where they are injected
any old place in the kernel. For example, if a command had really got
stuck on the network then the nop timer would fire which forces the
iscsi thread's recv() to fail and that submitting thread to exit. Or we
should handle the delay-request-in-tcmu-runner.diff issue ok, because we
wait for those commands. However, we could just get rescheduled due to
hitting a preemption point and we might not be rescheduled for longer
than failover timeout seconds. For this it could just be some buggy code
that gets run on all the cpus for more than failover timeout seconds
then recovers, and we would hit the bug in your patch above.

The 2 attached patches fix the issues for me on linux. Note that it only
works on linux right now and it only works with 2 nodes. It probably
also works for ESX/windows, but I need to reconfig some timers.

Apply ceph-iscsi-config-explicit-standby.patch to ceph-iscsi-config and
tcmu-runner-use-explicit.patch to tcmu-runner.

diff --git a/alua.c b/alua.c
index 9b36e9f..20e01ef 100644
--- a/alua.c
+++ b/alua.c
@@ -56,6 +56,17 @@ static int tcmu_get_alua_int_setting(struct alua_grp *group,
 	return tcmu_get_cfgfs_int(path);
 }
 
+static int tcmu_set_alua_int_setting(struct alua_grp *group,
+				     const char *setting, int val)
+{
+	char path[PATH_MAX];
+
+	snprintf(path, sizeof(path), CFGFS_CORE"/%s/%s/alua/%s/%s",
+		 group->dev->tcm_hba_name, group->dev->tcm_dev_name,
+		 group->name, setting);
+	return tcmu_set_cfgfs_ul(path, val);
+}
+
 static void tcmu_release_tgt_ports(struct alua_grp *group)
 {
 	struct tgt_port *port, *port_next;
@@ -205,10 +216,28 @@ tcmu_get_alua_grp(struct tcmu_device *dev, const char *name)
 		rdev->failover_type = TMCUR_DEV_FAILOVER_IMPLICIT;
 
 		group->tpgs = TPGS_ALUA_IMPLICIT;
-	} else if (!strcmp(str_val, "Explicit") ||
-		   !strcmp(str_val, "Implicit and Explicit")) {
-		tcmu_dev_warn(dev, "Unsupported alua_access_type: Explicit failover not supported.\n");
+	} else if (!strcmp(str_val, "Explicit")) {
+		/*
+		 * kernel requires both implicit and explicit so we can
+		 * update the state via configfs.
+		 */
+		tcmu_dev_warn(dev, "Unsupported alua_access_type: Explicit only failover not supported.\n");
+
 		goto free_str_val;
+	} else if (!strcmp(str_val, "Implicit and Explicit")) {
+		if (!failover_is_supported(dev)) {
+			tcmu_dev_err(dev, "device failover is not supported with the alua access type: Implicit and Explicit\n");
+			goto free_str_val;
+		}
+
+		/*
+		 * Only report explicit so initiator always sends STPG.
+		 * We only need implicit enabled in the kernel so we can
+		 * interact with the alua configfs interface.
+		 */
+		rdev->failover_type = TMCUR_DEV_FAILOVER_EXPLICIT;
+
+		group->tpgs = TPGS_ALUA_EXPLICIT;
 	} else {
 		tcmu_dev_err(dev, "Invalid ALUA type %s", str_val);
 		goto free_str_val;
@@ -344,22 +373,60 @@ struct tgt_port *tcmu_get_enabled_port(struct list_head *group_list)
 	return NULL;
 }
 
+static uint8_t lock_state_to_alua_state(struct tcmu_device *dev,
+					struct alua_grp *group,
+					struct tgt_port *enabled_port,
+					int lock_state)
+{
+	struct tcmur_device *rdev = tcmu_get_daemon_dev_private(dev);
+
+	if (rdev->failover_type != TMCUR_DEV_FAILOVER_EXPLICIT)
+		return group->state;
+
+	/* we only support standby and AO for now */
+	switch (lock_state) {
+	case TCMUR_DEV_LOCK_NO_HOLDERS:
+		return ALUA_ACCESS_STATE_STANDBY;
+	case TCMUR_DEV_LOCK_LOCKED:
+		if (enabled_port->grp == group)
+			return ALUA_ACCESS_STATE_OPTIMIZED;
+		return ALUA_ACCESS_STATE_STANDBY;
+	case TCMUR_DEV_LOCK_UNLOCKED:
+		if (enabled_port->grp == group)
+			return ALUA_ACCESS_STATE_STANDBY;
+		/*
+		 * This only works for 2 nodes:
+		 * Someone has the lock. It is not the local node and the group
+		 * is for the remote node, so it must be AO.
+		 *
+		 * TODO: Add interface to allow the cluster to tell us
+		 * which backend client match which alua group.
+		 */
+		return ALUA_ACCESS_STATE_OPTIMIZED;
+	case TCMUR_DEV_LOCK_UNKNOWN:
+		return ALUA_ACCESS_STATE_STANDBY;
+	}
+
+	return ALUA_ACCESS_STATE_STANDBY;
+}
+
 int tcmu_emulate_report_tgt_port_grps(struct tcmu_device *dev,
 				      struct list_head *group_list,
 				      struct tcmulib_cmd *cmd)
 {
 	struct alua_grp *group;
-	struct tgt_port *port;
+	struct tgt_port *port, *enabled_port;
 	int ext_hdr = cmd->cdb[1] & 0x20;
 	uint32_t off = 4, ret_data_len = 0, ret32;
 	uint32_t alloc_len = tcmu_get_xfer_length(cmd->cdb);
-	uint8_t *buf;
+	uint8_t *buf, state;
+	int lock_state;
 
-	if (!tcmu_get_enabled_port(group_list))
+	enabled_port = tcmu_get_enabled_port(group_list);
+	if (!enabled_port)
+		/* unsupported config */
 		return TCMU_NOT_HANDLED;
 
-	tcmu_update_dev_lock_state(dev);
-
 	if (alloc_len < 4)
 		return tcmu_set_sense_data(cmd->sense_buf, ILLEGAL_REQUEST,
 					   ASC_INVALID_FIELD_IN_CDB, NULL);
@@ -369,6 +436,8 @@ int tcmu_emulate_report_tgt_port_grps(struct tcmu_device *dev,
 		return tcmu_set_sense_data(cmd->sense_buf, HARDWARE_ERROR,
 					   ASC_INTERNAL_TARGET_FAILURE, NULL);
 
+	lock_state = tcmu_update_dev_lock_state(dev);
+
 	if (ext_hdr && alloc_len > 5) {
 		buf[4] = 0x10;
 		/*
@@ -392,6 +461,27 @@ int tcmu_emulate_report_tgt_port_grps(struct tcmu_device *dev,
 		if (group->pref)
 			buf[off] = 0x80;
 
+		state = lock_state_to_alua_state(dev, group, enabled_port,
+						 lock_state);
+		/*
+		 * Some handlers are not able to async update state during STPG
+	 	 * so update it now.
+	 	 */
+		if (state != group->state) {
+			group->state = state;
+			if (tcmu_set_alua_int_setting(group,
+						      "alua_access_state",
+						      state)) {
+				/*
+				 * this should never happen so just log it.
+				 * If it does we catch it in check state
+				 * or the blacklisting
+				 */
+				tcmu_dev_err(dev, "Could not change kernel state to %u\n",
+					     state);
+			}
+		}
+
 		buf[off++] |= group->state;
 		buf[off++] |= group->supported_states;
 		buf[off++] = (group->id >> 8) & 0xff;
@@ -433,7 +523,7 @@ bool failover_is_supported(struct tcmu_device *dev)
 static void *alua_lock_thread_fn(void *arg)
 {
 	/* TODO: set UA based on bgly's patches */
-	tcmu_acquire_dev_lock(arg);
+	tcmu_acquire_dev_lock(arg, false);
 	return NULL;
 }
 
@@ -481,3 +571,195 @@ done:
 	pthread_mutex_unlock(&rdev->state_lock);
 	return ret;
 }
+
+static bool alua_check_sup_state(uint8_t state, uint8_t sup)
+{
+	switch (state) {
+	case ALUA_ACCESS_STATE_OPTIMIZED:
+		if (sup & ALUA_SUP_OPTIMIZED)
+			return true;
+		return false;
+	case ALUA_ACCESS_STATE_NON_OPTIMIZED:
+		if (sup & ALUA_SUP_NON_OPTIMIZED)
+			return true;
+		return false;
+	case ALUA_ACCESS_STATE_STANDBY:
+		if (sup & ALUA_SUP_STANDBY)
+			return true;
+		return false;
+	case ALUA_ACCESS_STATE_UNAVAILABLE:
+		if (sup & ALUA_SUP_UNAVAILABLE)
+			return true;
+		return false;
+	case ALUA_ACCESS_STATE_OFFLINE:
+		/*
+		 * TODO: support secondary states
+		 */
+		return false;
+	}
+
+	return false;
+}
+
+static int tcmu_explicit_transition(struct alua_grp *group,
+				    uint8_t new_state, uint8_t alua_status,
+				    uint8_t *sense)
+{
+	struct tcmu_device *dev = group->dev;
+	int ret;
+
+	tcmu_dev_dbg(dev, "transition group %u new state %u old state %u sup 0x%x\n",
+		     group->id, new_state, group->state, group->supported_states);
+
+	if (!alua_check_sup_state(new_state, group->supported_states))
+		return tcmu_set_sense_data(sense, ILLEGAL_REQUEST,
+					   ASC_INVALID_FIELD_IN_PARAMETER_LIST,
+					   NULL);
+
+	switch (new_state) {
+	case ALUA_ACCESS_STATE_OPTIMIZED:
+		if (failover_is_supported(dev) &&
+		    tcmu_acquire_dev_lock(dev, true)) {
+			return tcmu_set_sense_data(sense, HARDWARE_ERROR,
+						   ASC_STPG_CMD_FAILED, NULL);
+		}
+		break;
+	case ALUA_ACCESS_STATE_NON_OPTIMIZED:
+	case ALUA_ACCESS_STATE_UNAVAILABLE:
+	case ALUA_ACCESS_STATE_OFFLINE:
+		/* TODO we only support standy and AO */
+		tcmu_dev_err(dev, "Igoring AO/unavail/offline\n");
+		return tcmu_set_sense_data(sense, ILLEGAL_REQUEST,
+					   ASC_INVALID_FIELD_IN_PARAMETER_LIST,
+					   NULL);
+	case ALUA_ACCESS_STATE_STANDBY:
+		/*
+		 * TODO: we only see this in verification tests.
+		 * Add back unlock in final commit.
+		 */
+		tcmu_dev_err(dev, "Igoring standby\n");
+		return tcmu_set_sense_data(sense, ILLEGAL_REQUEST,
+					   ASC_INVALID_FIELD_IN_PARAMETER_LIST,
+					   NULL);
+	default:
+		return tcmu_set_sense_data(sense, ILLEGAL_REQUEST,
+					   ASC_INVALID_FIELD_IN_PARAMETER_LIST,
+					   NULL);
+	}
+
+	ret = tcmu_set_alua_int_setting(group, "alua_access_state", new_state);
+	if (ret) {
+		tcmu_dev_err(dev, "Could not change kernel state to %u\n",
+			     new_state);
+		/*
+		 * TODO drop the lock
+		 */
+		return tcmu_set_sense_data(sense, HARDWARE_ERROR,
+					   ASC_STPG_CMD_FAILED, NULL);
+	}
+
+	ret = tcmu_set_alua_int_setting(group, "alua_access_status", alua_status);
+	if (ret)
+		/* Only the RTPG info will be wrong, so just log an error. */
+		tcmu_dev_err(dev, "Could not set alua_access_status for group %s:%d\n",
+			     group->name, group->id);
+
+	group->state = new_state;
+	group->status = alua_status;
+	return SAM_STAT_GOOD;
+}
+
+int tcmu_emulate_set_tgt_port_grps(struct tcmu_device *dev,
+				   struct list_head *group_list,
+				   struct tcmulib_cmd *cmd)
+{
+	struct alua_grp *group;
+	uint32_t off = 4, param_list_len = tcmu_get_xfer_length(cmd->cdb);
+	uint16_t id, tmp_id;
+	char *buf, new_state;
+	int found, ret = SAM_STAT_GOOD;
+
+	if (!tcmu_get_enabled_port(group_list))
+		return TCMU_NOT_HANDLED;
+
+	if (!param_list_len)
+		return SAM_STAT_GOOD;
+
+	buf = calloc(1, param_list_len);
+	if (!buf)
+		return tcmu_set_sense_data(cmd->sense_buf, HARDWARE_ERROR,
+					   ASC_INTERNAL_TARGET_FAILURE, NULL);
+
+	if (tcmu_memcpy_from_iovec(buf, param_list_len, cmd->iovec,
+				   cmd->iov_cnt) != param_list_len) {
+		ret = tcmu_set_sense_data(cmd->sense_buf, ILLEGAL_REQUEST,
+					  ASC_PARAMETER_LIST_LENGTH_ERROR,
+					  NULL);
+		goto free_buf;
+	}
+
+	while (off < param_list_len) {
+		new_state = buf[off++] & 0x0f;
+		/* reserved */
+		off++;
+		memcpy(&tmp_id, &buf[off], sizeof(tmp_id));
+		id = be16toh(tmp_id);
+		off += 2;
+
+		found = 0;
+		list_for_each(group_list, group, entry) {
+			if (group->id != id)
+				continue;
+
+			tcmu_dev_dbg(dev, "Got STPG for group %u\n", id);
+			ret = tcmu_explicit_transition(group, new_state,
+					ALUA_STAT_ALTERED_BY_EXPLICIT_STPG,
+					cmd->sense_buf);
+			if (ret) {
+				tcmu_dev_err(dev, "Failing STPG for group %d\n",
+					      id);
+				goto free_buf;
+			}
+			found = 1;
+			break;
+		}
+
+		if (!found) {
+			/*
+			 * Could not find what error code to return in SCSI
+			 * spec.
+		  	 */
+			tcmu_dev_err(dev, "Could not find group for %u for STPG\n",
+				      id);
+			ret = tcmu_set_sense_data(cmd->sense_buf,
+						  HARDWARE_ERROR,
+						  ASC_STPG_CMD_FAILED, NULL);
+			break;
+		}
+	}
+
+free_buf:
+	free(buf);
+	return ret;
+}
+
+int alua_check_state(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
+{
+	struct tcmur_device *rdev = tcmu_get_daemon_dev_private(dev);
+
+	if (!failover_is_supported(dev))
+		return 0;
+
+        if (rdev->failover_type == TMCUR_DEV_FAILOVER_EXPLICIT) {
+		if (rdev->lock_state != TCMUR_DEV_LOCK_LOCKED) {
+			tcmu_dev_dbg(dev, "device lock not held.\n");
+			return tcmu_set_sense_data(cmd->sense_buf, NOT_READY,
+						   ASC_PORT_IN_STANDBY,
+						   NULL);
+		}
+	} else if (rdev->failover_type == TMCUR_DEV_FAILOVER_IMPLICIT) {
+                return alua_implicit_transition(dev, cmd);
+        }
+
+	return 0;
+}
diff --git a/alua.h b/alua.h
index 9d80573..dd136c2 100644
--- a/alua.h
+++ b/alua.h
@@ -47,10 +47,14 @@ struct alua_grp {
 int tcmu_emulate_report_tgt_port_grps(struct tcmu_device *dev,
 				      struct list_head *group_list,
 				      struct tcmulib_cmd *cmd);
+int tcmu_emulate_set_tgt_port_grps(struct tcmu_device *dev,
+				   struct list_head *group_list,
+				   struct tcmulib_cmd *cmd);
 struct tgt_port *tcmu_get_enabled_port(struct list_head *);
 int tcmu_get_alua_grps(struct tcmu_device *, struct list_head *);
 void tcmu_release_alua_grps(struct list_head *);
 int alua_implicit_transition(struct tcmu_device *dev, struct tcmulib_cmd *cmd);
 bool failover_is_supported(struct tcmu_device *dev);
+int alua_check_state(struct tcmu_device *dev, struct tcmulib_cmd *cmd);
 
 #endif
diff --git a/rbd.c b/rbd.c
index a0bfef4..a170e6d 100644
--- a/rbd.c
+++ b/rbd.c
@@ -513,15 +513,30 @@ static int tcmu_rbd_has_lock(struct tcmu_device *dev)
 
 static int tcmu_rbd_get_lock_state(struct tcmu_device *dev)
 {
+	struct tcmu_rbd_state *state = tcmu_get_dev_private(dev);
+	rbd_lock_mode_t lock_mode;
+	char *owners[1];
+	size_t num_owners = 1;
 	int ret;
 
+	ret = rbd_lock_get_owners(state->image, &lock_mode, owners,
+				  &num_owners);
+	if (ret == -ENOENT || (!ret && !num_owners)) {
+		tcmu_dev_dbg(dev, "no holders %d\n", ret);
+		return TCMUR_DEV_LOCK_NO_HOLDERS;
+	}
+	if (!ret && num_owners)
+		rbd_lock_get_owners_cleanup(owners, num_owners);
+
 	ret = tcmu_rbd_has_lock(dev);
-	if (ret == 1)
+	if (ret == 1) {
 		return TCMUR_DEV_LOCK_LOCKED;
-	else if (ret == 0 || ret == -ESHUTDOWN)
+	} else if (ret == 0 || ret == -ESHUTDOWN) {
 		return TCMUR_DEV_LOCK_UNLOCKED;
-	else
+	} else {
+		tcmu_notify_conn_lost(dev);
 		return TCMUR_DEV_LOCK_UNKNOWN;
+	}
 }
 
 /**
@@ -604,7 +619,7 @@ static int tcmu_rbd_lock(struct tcmu_device *dev)
 	 * TODO: Add retry/timeout settings to handle windows/ESX.
 	 * Or, set to transitioning and grab the lock in the background.
 	 */
-	while (attempts++ < 5) {
+	while (attempts++ < 2) {
 		ret = tcmu_rbd_has_lock(dev);
 		if (ret == 1) {
 			ret = 0;
diff --git a/tcmur_cmd_handler.c b/tcmur_cmd_handler.c
index 46c1dd7..d41c23b 100644
--- a/tcmur_cmd_handler.c
+++ b/tcmur_cmd_handler.c
@@ -825,31 +825,13 @@ static int tcmur_writesame_work_fn(struct tcmu_device *dev,
 	return write_same_fn(dev, cmd, off, len, cmd->iovec, cmd->iov_cnt);
 }
 
-static inline int tcmur_alua_implicit_transition(struct tcmu_device *dev,
-					  struct tcmulib_cmd *cmd)
-{
-	struct tcmur_device *rdev = tcmu_get_daemon_dev_private(dev);
-	int ret;
-
-	if (!failover_is_supported(dev))
-		return 0;
-
-	if (rdev->failover_type == TMCUR_DEV_FAILOVER_IMPLICIT) {
-		ret = alua_implicit_transition(dev, cmd);
-		if (ret)
-			return ret;
-	}
-
-	return 0;
-}
-
 int tcmur_handle_writesame(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
 			   tcmur_writesame_fn_t write_same_fn)
 {
 	struct tcmur_handler *rhandler = tcmu_get_runner_handler(dev);
 	int ret;
 
-	ret = tcmur_alua_implicit_transition(dev, cmd);
+	ret = alua_check_state(dev, cmd);
 	if (ret)
 		return ret;
 
@@ -1886,7 +1868,7 @@ int tcmur_handle_caw(struct tcmu_device *dev, struct tcmulib_cmd *cmd,
 {
         int ret;
 
-        ret = tcmur_alua_implicit_transition(dev, cmd);
+        ret = alua_check_state(dev, cmd);
         if (ret)
                 return ret;
 
@@ -2214,6 +2196,23 @@ clear_format:
 }
 
 /* ALUA */
+static int handle_stpg(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
+{
+	struct list_head group_list;
+	int ret;
+
+	list_head_init(&group_list);
+
+	ret = tcmu_get_alua_grps(dev, &group_list);
+	if (ret)
+		return tcmu_set_sense_data(cmd->sense_buf, HARDWARE_ERROR,
+					   ASC_INTERNAL_TARGET_FAILURE, NULL);
+
+	ret = tcmu_emulate_set_tgt_port_grps(dev, &group_list, cmd);
+	tcmu_release_alua_grps(&group_list);
+	return ret;
+}
+
 static int handle_rtpg(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
 {
 	struct list_head group_list;
@@ -2299,7 +2298,7 @@ static int tcmur_cmd_handler(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
 		goto untrack;
 	}
 
-	ret = tcmur_alua_implicit_transition(dev, cmd);
+	ret = alua_check_state(dev, cmd);
 	if (ret)
 		goto untrack;
 
@@ -2422,6 +2421,10 @@ static int handle_sync_cmd(struct tcmu_device *dev, struct tcmulib_cmd *cmd)
 		if ((cdb[1] & 0x1f) == RCR_SA_OPERATING_PARAMETERS)
 			return handle_recv_copy_result(dev, cmd);
 		return TCMU_NOT_HANDLED;
+	case MAINTENANCE_OUT:
+		if (cdb[1] == MO_SET_TARGET_PGS)
+			return handle_stpg(dev, cmd);
+		return TCMU_NOT_HANDLED;
 	case MAINTENANCE_IN:
 		if ((cdb[1] & 0x1f) == MI_REPORT_TARGET_PGS)
 			return handle_rtpg(dev, cmd);
diff --git a/tcmur_device.c b/tcmur_device.c
index 5175924..e672858 100644
--- a/tcmur_device.c
+++ b/tcmur_device.c
@@ -235,14 +235,14 @@ int tcmu_cancel_lock_thread(struct tcmu_device *dev)
  * lock. Update lock state now to avoid firing the error
  * handler later.
  */
-void tcmu_update_dev_lock_state(struct tcmu_device *dev)
+int tcmu_update_dev_lock_state(struct tcmu_device *dev)
 {
 	struct tcmur_handler *rhandler = tcmu_get_runner_handler(dev);
 	struct tcmur_device *rdev = tcmu_get_daemon_dev_private(dev);
 	int state;
 
 	if (!rhandler->get_lock_state)
-		return;
+		return -1;
 
 	state = rhandler->get_lock_state(dev);
 	pthread_mutex_lock(&rdev->state_lock);
@@ -252,9 +252,10 @@ void tcmu_update_dev_lock_state(struct tcmu_device *dev)
 		rdev->lock_state = TCMUR_DEV_LOCK_UNLOCKED;
 	}
 	pthread_mutex_unlock(&rdev->state_lock);
+	return state;
 }
 
-int tcmu_acquire_dev_lock(struct tcmu_device *dev)
+int tcmu_acquire_dev_lock(struct tcmu_device *dev, bool skip_flush)
 {
 	struct tcmur_handler *rhandler = tcmu_get_runner_handler(dev);
 	struct tcmur_device *rdev = tcmu_get_daemon_dev_private(dev);
@@ -272,9 +273,12 @@ int tcmu_acquire_dev_lock(struct tcmu_device *dev)
 
 	/*
 	 * Handle race where cmd could be in tcmur_generic_handle_cmd before
-	 * the aio handler.
+	 * the aio handler. For explicit ALUA, the multipath layer should
+	 * only have the STPG and commands like RTPG in flight so skip,
+	 * because we cannot wait on ourself.
 	 */
-	tcmu_flush_device(dev);
+	if (!skip_flush)
+		tcmu_flush_device(dev);
 
 retry:
 	tcmu_dev_dbg(dev, "lock call state %d retries %d\n",
diff --git a/tcmur_device.h b/tcmur_device.h
index f831a61..a30a123 100644
--- a/tcmur_device.h
+++ b/tcmur_device.h
@@ -34,12 +34,14 @@
 enum {
 	TMCUR_DEV_FAILOVER_ALL_ACTIVE,
 	TMCUR_DEV_FAILOVER_IMPLICIT,
+	TMCUR_DEV_FAILOVER_EXPLICIT,
 };
 
 enum {
 	TCMUR_DEV_LOCK_UNLOCKED,
 	TCMUR_DEV_LOCK_LOCKED,
 	TCMUR_DEV_LOCK_LOCKING,
+	TCMUR_DEV_LOCK_NO_HOLDERS,
 	TCMUR_DEV_LOCK_UNKNOWN,
 };
 
@@ -88,7 +90,7 @@ void tcmu_notify_lock_lost(struct tcmu_device *dev);
 int __tcmu_reopen_dev(struct tcmu_device *dev, bool in_lock_thread);
 int tcmu_reopen_dev(struct tcmu_device *dev);
 
-int tcmu_acquire_dev_lock(struct tcmu_device *dev);
-void tcmu_update_dev_lock_state(struct tcmu_device *dev);
+int tcmu_acquire_dev_lock(struct tcmu_device *dev, bool skip_ring_flush);
+int tcmu_update_dev_lock_state(struct tcmu_device *dev);
 
 #endif
diff --git a/ceph_iscsi_config/gateway.py b/ceph_iscsi_config/gateway.py
index 68f7416..ca7fd1a 100644
--- a/ceph_iscsi_config/gateway.py
+++ b/ceph_iscsi_config/gateway.py
@@ -313,13 +313,12 @@ class GWTarget(object):
                                  "group id {}".format(stg_object.name, tpg.tag))
                 group_name = "ao"
                 alua_tpg = ALUATargetPortGroup(stg_object, group_name, tpg.tag)
-                alua_tpg.alua_access_state = 0
+                alua_tpg.preferred = 1
             else:
-                self.logger.info("setting {} to ALUA/ActiveNONOptimised "
+                self.logger.info("setting {} to ALUA/Standby"
                                  "group id {}".format(stg_object.name, tpg.tag))
-                group_name = "ano{}".format(tpg.tag)
+                group_name = "standby{}".format(tpg.tag)
                 alua_tpg = ALUATargetPortGroup(stg_object, group_name, tpg.tag)
-                alua_tpg.alua_access_state = 1
         except RTSLibError as err:
                 self.logger.info("ALUA group id {} for stg obj {} lun {} "
                                  "already made".format(tpg.tag, stg_object, lun))
@@ -334,12 +333,16 @@ class GWTarget(object):
                 # were not able to bind to a lun last time.
 
         self.logger.debug("ALUA defined, updating state")
-        # Use implicit failover
-        alua_tpg.alua_access_type = 1
+        # Use Explicit but also set the Implicit bit so we can
+        # update the kernel from configfs.
+        alua_tpg.alua_access_type = 3
+        # start ports in Standby, and let the initiator drive the initial
+        # transition to AO.
+        alua_tpg.alua_access_state = 2
 
         alua_tpg.alua_support_offline = 0
-        alua_tpg.alua_support_unavailable = 1
-        alua_tpg.alua_support_standby = 0
+        alua_tpg.alua_support_unavailable = 0
+        alua_tpg.alua_support_standby = 1
         alua_tpg.alua_support_transitioning = 1
         alua_tpg.implicit_trans_secs = 60
         alua_tpg.nonop_delay_msecs = 0
diff --git a/rbd-target-gw.py b/rbd-target-gw.py
index 1daee92..17143b2 100755
--- a/rbd-target-gw.py
+++ b/rbd-target-gw.py
@@ -263,6 +263,30 @@ def define_gateway():
     return gateway
 
 
+def rbd_lock_cleanup(local_ips, rbd_image):
+    """
+    cleanup locks left if this node crashed and was not able to release them
+    :param local_ips: list of local ip addresses.
+    :rbd_image: rbd image to clean up locking for
+    :return: None
+    """
+
+    lock_info = rbd_image.list_lockers()
+    if not lock_info:
+        return
+
+    lockers = lock_info.get("lockers")
+    for holder in lockers:
+        for ip in local_ips:
+            if ip in holder[2]:
+                logger.info("Cleaning up stale local lock for {} {}".format(
+                            holder[0], holder[1]))
+                try:
+                    rbd_image.break_lock(holder[0], holder[1])
+                except:
+                    halt("Error cleaning up rbd image {}".format(rbd_image))
+
+
 def define_luns(gateway):
     """
     define the disks in the config to LIO
@@ -272,6 +296,11 @@ def define_luns(gateway):
 
     local_gw = this_host()
 
+    ipv4_list = []
+    for iface in netifaces.interfaces():
+        dev_info = netifaces.ifaddresses(iface).get(netifaces.AF_INET, [])
+        ipv4_list += [dev['addr'] for dev in dev_info]
+
     # sort the disks dict keys, so the disks are registered in a specific
     # sequence
     disks = config.config['disks']
@@ -298,6 +327,8 @@ def define_luns(gateway):
                                 image_bytes = rbd_image.size()
                                 image_size_h = human_size(image_bytes)
 
+                                rbd_lock_cleanup(ipv4_list, rbd_image)
+
                                 lun = LUN(logger, pool, image_name,
                                           image_size_h, local_gw)
                                 if lun.error:
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com