Re: iSCSI Multipath (Load Balancing) vs RBD Exclusive Lock

Mike Christie <mchristi@xxxxxxxxxx> · Thu, 15 Mar 2018 16:42:55 -0500

On 03/15/2018 02:32 PM, Maxim Patlasov wrote:
> On Thu, Mar 15, 2018 at 12:48 AM, Mike Christie <mchristi@xxxxxxxxxx
> <mailto:mchristi@xxxxxxxxxx>> wrote:
> 
>     ...
> 
>     It looks like there is a bug.
> 
>     1. A regression was added when I stopped killing the iscsi connection
>     when the lock is taken away from us to handle a failback bug where it
>     was causing ping ponging. That combined with #2 will cause the bug.
> 
>     2. I did not anticipate the type of sleeps above where they are injected
>     any old place in the kernel. For example, if a command had really got
>     stuck on the network then the nop timer would fire which forces the
>     iscsi thread's recv() to fail and that submitting thread to exit. Or we
>     should handle the delay-request-in-tcmu-runner.diff issue ok, because we
>     wait for those commands. However, we could just get rescheduled due to
>     hitting a preemption point and we might not be rescheduled for longer
>     than failover timeout seconds. For this it could just be some buggy code
>     that gets run on all the cpus for more than failover timeout seconds
>     then recovers, and we would hit the bug in your patch above.
> 
>     The 2 attached patches fix the issues for me on linux. Note that it only
>     works on linux right now and it only works with 2 nodes. It probably
>     also works for ESX/windows, but I need to reconfig some timers.
> 
>     Apply ceph-iscsi-config-explicit-standby.patch to ceph-iscsi-config and
>     tcmu-runner-use-explicit.patch to tcmu-runner.
> 
> 
> 
> Mike, thank you for patches, they seem to work. There is an issue, but
> not related to data corruption: if the second path (gateway) is not
> available and I restart tcmu-runner on the first gateway, all subsequent
> i/o hangs for long because tcmu-runner is in UNLOCKED state and
> initiator doesn't resend explicit ALUA activation request for long while
> (190s).

Yeah, I should have a fix for that. We are returning the wrong error
code for explicit alua. I needed to change it to a value that indicated
we are in a state where we do not have the lock (we are in alua standby)
so the initiator does not keep retrying until the scsi command *
max_retries check if fired in the linux scsi layer.

Jason suggested how to properly support vmware/windows and more than 2
nodes. The fix for that will allow me to properly figure out lock states
and return the proper error codes for that error. I am hoping to be done
with a ruff code tomorrow.

> 
> Can you please also clarify how explicit ALUA (with these patches
> applied) is immune to a situation when there are some stale requests
> sitting in kernel queues by the moment tcmu-runner handles
> tcmu_explicit_transition() --> tcmu_acquire_dev_lock(). Does it mean
> that all requests are strictly ordered and initiator will never send us
> read/wrtie requests until we complete that explicit ALUA activation request?
> 

Basically yes. Here is some extra info and what I wrote on github for
people that do not like GH:

- There is only one cmd submitting thread per iscsi session, so commands
are put in the tcmu queue in order and then tcmu-runner only has the one
thread per device that initially checks the commands and decides if we
need to do a failover or dispatch to the handler.

For your test it would work like this:

1. STPG successfully executed on node1.
2. WRITEs sent and get stuck on node1.
3. Failover to node2. WRITEs execute ok on this node.
4. If the WRITEs are unjammed at this time they are just failed, because
we will hit the blacklist checks or unlocked checks.
5. If node2 were to fail while the commands were still stuck then
node1's iscsi session would normally have dropped and lio would not be
allowing new logins due to the stuck WRITEs on node1 (So this is when
you commonly see the ABORT messages and stuck logins that are reported
on the list every once in a while).

If the initiator did not escalate to session level recovery, then before
doing new IO the initiator would send a STPG and that would be stuck
behind the stuck WRITEs from step 2. Before we can dequeue the STPG in
runner then we have to wait for the stuck WRITEs.

Note the runner STPG/lock code will also wait for commands that have
been sent to the handler module or are stuck in a runner thread before
doing starting the lock acquire call, so if a WRITE got stuck there we
will be ok.
6. Once the WRITEs unjam and are failed the STPG is executed. If the
STPG is successful, that is reported to the initiator and it will start
sending IO.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com