Re: iSCSI active/active stale io guard

"Maged Mokhtar" <mmokhtar@xxxxxxxxxxx> · Tue, 3 Apr 2018 19:27:27 +0200

--------------------------------------------------
From: "Gregory Farnum" <gfarnum@xxxxxxxxxx>
Sent: Tuesday, April 03, 2018 2:34 AM
To: "Maged Mokhtar" <mmokhtar@xxxxxxxxxxx>
Cc: "David Disseldorp" <ddiss@xxxxxxx>; "ceph-devel" 
<ceph-devel@xxxxxxxxxxxxxxx>
Subject: Re: iSCSI active/active stale io guard

On Mon, Apr 2, 2018 at 1:22 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> 
wrote:
On 2018-04-02 20:00, Gregory Farnum wrote:

On Fri, Mar 23, 2018 at 8:02 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
wrote: On 2018-03-23 15:22, David Disseldorp wrote:

Hi Maged,

On Mon, 19 Mar 2018 01:43:38 +0200, Maged Mokhtar wrote:

2) Guarded OSD Write Operations
We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME..
ops or support hints with existing versions to pass a source operation
time. A configurable timeout on the OSD server (osd_stale_op_timeout ?)
will be used to reject stale write operations. A suggested default value
of 10 sec. False negatives not due to stale io will fail but will be
retried by client. Any operation time received in the future (greater
than ntp  max skew) should be rejected. These new operations are generic
enough and may be used outside of iSCSI.

I think a RADOS class function that offers request expiration on the OSD
would be helpful. However, aside from concerns around the client time
synchronisation dependence, I'm a little unsure how this should be
handled on the RBD client / iSCSI gw side. Prepending the expiry
operation to the OSD request prior to a write op would only catch stale
requests while being queued at the OSD, the subsequent write operation
could still be handled well after expiry. Ideally the expiration check
would be performed after the write, with rollback occurring on expiry.

Cheers, David

Hi David,

The iSCSI gateway will detect the time the initiator built the iSCSI cdb
header packet at the start of the write operation then it will propagate
this time down to krbd/tcmu-lbrbd which in turn will be send it with all 
OSD
requests making up this write request. The method outlined uses TCP
timestamps (RFC7323) + a simple method to create a time sync between 
client
initiator and OSD server that is not dependent on gateway delays + does 
not
require ntp running on the client.

If the write request arrives at the OSD within the allowed time, for 
example
within 10 sec, it will be allowed to proceed by this new guard condition.
This is OK even if there is high queue delay/commit latency at the OSD. 
It
is OK since our primary concern is solving the edge condition where a 
stale
write could arrive at the OSD after newer writes were subsequently issued 
by
the initiator, so potentially stale writes could overwrite on top of 
newer
data. By making sure the initiator is configured to take longer than 10 
sec
to abort the task and retry it on a different path, we are sure we will 
not
have the case of old data over-writing new data.
I don't think timeouts are a good path forward for this. Among other
things, I'm concerned about how they'd interact with OSD recovery and
peering. Not to mention, what happens when an OSD and a client
disagree about whether an op was discarded due to timeout?

In the iSCSI case, if a 10-second wait period is acceptable to begin
with then it seems much simpler and less complicated for the
"failover" initiator to blacklist the failed one and force all the
OSDs to assimilate that blacklist before processing ops?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi Greg

You are right, the 10 sec is probably not a practical value to account 
for
OSD failover, i think maybe 25-30 sec should be more reasonable. This OSD
timeout can be adjusted via osd_heartbeat_grace + osd_heartbeat_interval.
The real reason for the proposed solution is to handle cases when a path
failover occurs yet the original target node is not dead and could have
inflight io stuck anywhere in its stack due to congestion, flaky network
connections as well as the (more common) OSD down case. Such cases may 
lead
in some extreme conditions for stale io to potentially overwrite newer io
after a path failover, these cases were very well described by Mike in:
https://www.spinics.net/lists/ceph-users/msg43402.html

The reason we do not perform initiator side blacklisting, is to support 
non
Linux clients such as VMWare ESX and Windows. It could be possible to 
write
custom client code on such platforms but it will be much simpler and more
generic to do it via timeouts, although it may not the most elagant
solution.
iSCSI MPIO does not provide any means for a target to detect if a command
received was a retry from another failed path, iSCSI MCS (Multiple
Connections
per Session) supports this but is neither supported by the Linux iSCSI
target
nor by the VMWare initiator.

Ah sorry, I got my language wrong. I meant blacklisting the iSCSI
target; the initiator doesn't exist as far as Ceph is concerned. :)

if osd_stale_op_timeout = max allowed inflight time between client and 
osd
osd_heartbeat_grace + osd_heartbeat_interval < osd_stale_op_timeout
osd_stale_op_timeout < replacement_timeout Linux open-iscsi initiator
osd_stale_op_timeout < RecoveryTimeout VMWare/ESX initiator
osd_stale_op_timeout < LinkDownTime Windows initiator

If an OSD rejects a valid io that, due to high latency, was received 
after
osd_stale_op_timeout, this would be a false positive rejection and the
command
will fail back to the initiator which will retry the command (depending 
on
the
client this could happen at the upper scsi or multipath layers). A stale 
io
will not be retried.

Anyway, I don't mean the timeouts are a problem because peering takes
time. I mean defining and understanding how to handle them during
transitions is hard, verging on impossible.

First of all, once an op is submitted to the OSD, you can't really
undo it. There is no "max allowed inflight time" and people go to a
great deal of trouble trying to simulate having that property or
writing code to pretend one exists, and then just ending the world if
somehow the network exceeds that time (real-world networks exceed any
given time you want to propose. They suck. It's impossible to believe
how long packets can spend in transit.). This is a fundamental
property of switched-network systems, because maybe our prior average
latencies were 100 microseconds but something just happened and this
one packet took 10 seconds or a central router died and we're now
suddenly trying to route 10GB/s of traffic through 3 GB/s of capacity
around in a ring. So the client needs to be able to deal with the OSD
completing an op that the client thinks it shouldn't have — that means
a simple timeout is just not sufficient to assume the op is no longer
going to happen. I think this already scuttles your plan.
( You can sort of work around this one, if you're very ambitious. The
server could do its best-guess about the "real" timeout and reject it
in a best-effort way, but the client wouldn't be able to start its
"definitely-canceled" timer until it got a *response* from the OSD
indicating the message came in. Given that we don't control the real
client's timeouts directly, I think that basically means this is
already impossible. )

Next, imagine a 10-second time-limited op came in with an acting set
of OSD 1, OSD 2, and it was replicated from OSD 1 to OSD 2. But then
OSD 1 (the primary) died. OSD 3 gets added to the PG after 15 seconds
(either it was down and came up, or it just got picked as the next
replica and is empty; doesn't matter). OSD 2 has *no idea* whether the
operation was ack'ed to the client. What is the correct behavior?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Hi Greg,

I think i need to clarify what this solutions solves. Its purpose is to
deal with a very specific corner case which can occur if you setup iSCSI
gateways with Ceph in a scale-out active/active configuration where all
gateways serve the same iSCSI disk/lun. This edge case was discussed
earlier within the mailing list as per the link in my prev post.
I will summarize it here as follows:
1) Client sends a write op to iscsi gateway A but does not get a reply.
Gateway A is not responding but is not down and has stale io inflight.
2) Client times out and sends the op to gateway B which returns quickly
3) Client sends a new write op to gateway C, the new op overlaps the
earlier write op sector/extent, the new data is written quickly
4) Gateway A becomes active and its old data overwrites new data.

The proposed solution is specific to this case: it stops old data over
writing new data. it is not meant to be a generic method to abort inflight
OSD or iSCSI operations.

It relies on the fact that for step 2), the client initiator will not
failover the op(s) to gateway B before closing the connection on A and
try to re-estabish/recover it on A. This recovery time is configurable
on Linux/ESX/Windows. We want to make sure that any inflight io from A
will reach the acting primary OSD before the client fails the io to B,
meaning within the recovery time, this will make sure no "old data
over-writing new data" will happen even if there is congestion at the
primary or secondary OSDs and even if the acting set later changes.
The time guard check is done by the primary OSD upon reception of the
guarded write op/class method, it is not used for replicas.

Another feature of the solution is to not rely on the iSCSI gateway to
measure timeouts itself since it could be in a bad/unkown shape,
instead we record the tcp timestamps of packets sent by the client.

Blacklisting/fencing of the gateway will make a cleaner solution i agree,
but the question is who will detect the problem and blacklist the
gateway ? With active/active MPIO, gateway B has no idea if the write op
is a retry of a failed op on gateway A, MPIO does not provide this.
In active/passive, the standby node can easily deduce the failover
condition and blacklist the first node, but this is not the case in
active/active. Of course the client initiator does know when gateway A
times out, in the case of Linux it is quite easy to add code to issue
the blacklisting action. For ESX/Windows client it could be done but will
not be as easy to develop and deploy such custom initiator code.

As a side note, though I am not sure it directly relates to the solution
above, yes it is not possible to predict how long packets take in transit
within a network but that does not prevent us from adding allowed timeout
limits. We can add timeouts at the tcp level or at the higher protocols such
as iSCSI if the application requires this.

Cheers Maged 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html