Re: iSCSI active/active stale io guard

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 2 Apr 2018 17:34:33 -0700

On Mon, Apr 2, 2018 at 1:22 PM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
> On 2018-04-02 20:00, Gregory Farnum wrote:
>
>> On Fri, Mar 23, 2018 at 8:02 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx>
>> wrote: On 2018-03-23 15:22, David Disseldorp wrote:
>>
>> Hi Maged,
>>
>> On Mon, 19 Mar 2018 01:43:38 +0200, Maged Mokhtar wrote:
>>
>> 2) Guarded OSD Write Operations
>> We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME..
>> ops or support hints with existing versions to pass a source operation
>> time. A configurable timeout on the OSD server (osd_stale_op_timeout ?)
>> will be used to reject stale write operations. A suggested default value
>> of 10 sec. False negatives not due to stale io will fail but will be
>> retried by client. Any operation time received in the future (greater
>> than ntp  max skew) should be rejected. These new operations are generic
>> enough and may be used outside of iSCSI.
>>
>> I think a RADOS class function that offers request expiration on the OSD
>> would be helpful. However, aside from concerns around the client time
>> synchronisation dependence, I'm a little unsure how this should be
>> handled on the RBD client / iSCSI gw side. Prepending the expiry
>> operation to the OSD request prior to a write op would only catch stale
>> requests while being queued at the OSD, the subsequent write operation
>> could still be handled well after expiry. Ideally the expiration check
>> would be performed after the write, with rollback occurring on expiry.
>>
>> Cheers, David
>
>
> Hi David,
>
> The iSCSI gateway will detect the time the initiator built the iSCSI cdb
> header packet at the start of the write operation then it will propagate
> this time down to krbd/tcmu-lbrbd which in turn will be send it with all OSD
> requests making up this write request. The method outlined uses TCP
> timestamps (RFC7323) + a simple method to create a time sync between client
> initiator and OSD server that is not dependent on gateway delays + does not
> require ntp running on the client.
>
> If the write request arrives at the OSD within the allowed time, for example
> within 10 sec, it will be allowed to proceed by this new guard condition.
> This is OK even if there is high queue delay/commit latency at the OSD. It
> is OK since our primary concern is solving the edge condition where a stale
> write could arrive at the OSD after newer writes were subsequently issued by
> the initiator, so potentially stale writes could overwrite on top of newer
> data. By making sure the initiator is configured to take longer than 10 sec
> to abort the task and retry it on a different path, we are sure we will not
> have the case of old data over-writing new data.
> I don't think timeouts are a good path forward for this. Among other
> things, I'm concerned about how they'd interact with OSD recovery and
> peering. Not to mention, what happens when an OSD and a client
> disagree about whether an op was discarded due to timeout?
>
> In the iSCSI case, if a 10-second wait period is acceptable to begin
> with then it seems much simpler and less complicated for the
> "failover" initiator to blacklist the failed one and force all the
> OSDs to assimilate that blacklist before processing ops?
> -Greg
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> Hi Greg
>
> You are right, the 10 sec is probably not a practical value to account for
> OSD failover, i think maybe 25-30 sec should be more reasonable. This OSD
> timeout can be adjusted via osd_heartbeat_grace + osd_heartbeat_interval.
> The real reason for the proposed solution is to handle cases when a path
> failover occurs yet the original target node is not dead and could have
> inflight io stuck anywhere in its stack due to congestion, flaky network
> connections as well as the (more common) OSD down case. Such cases may lead
> in some extreme conditions for stale io to potentially overwrite newer io
> after a path failover, these cases were very well described by Mike in:
> https://www.spinics.net/lists/ceph-users/msg43402.html
>
> The reason we do not perform initiator side blacklisting, is to support non
> Linux clients such as VMWare ESX and Windows. It could be possible to write
> custom client code on such platforms but it will be much simpler and more
> generic to do it via timeouts, although it may not the most elagant
> solution.
> iSCSI MPIO does not provide any means for a target to detect if a command
> received was a retry from another failed path, iSCSI MCS (Multiple
> Connections
> per Session) supports this but is neither supported by the Linux iSCSI
> target
> nor by the VMWare initiator.

Ah sorry, I got my language wrong. I meant blacklisting the iSCSI
target; the initiator doesn't exist as far as Ceph is concerned. :)

> if osd_stale_op_timeout = max allowed inflight time between client and osd
> osd_heartbeat_grace + osd_heartbeat_interval < osd_stale_op_timeout
> osd_stale_op_timeout < replacement_timeout Linux open-iscsi initiator
> osd_stale_op_timeout < RecoveryTimeout VMWare/ESX initiator
> osd_stale_op_timeout < LinkDownTime Windows initiator
>
> If an OSD rejects a valid io that, due to high latency, was received after
> osd_stale_op_timeout, this would be a false positive rejection and the
> command
> will fail back to the initiator which will retry the command (depending on
> the
> client this could happen at the upper scsi or multipath layers). A stale io
> will not be retried.

Anyway, I don't mean the timeouts are a problem because peering takes
time. I mean defining and understanding how to handle them during
transitions is hard, verging on impossible.

First of all, once an op is submitted to the OSD, you can't really
undo it. There is no "max allowed inflight time" and people go to a
great deal of trouble trying to simulate having that property or
writing code to pretend one exists, and then just ending the world if
somehow the network exceeds that time (real-world networks exceed any
given time you want to propose. They suck. It's impossible to believe
how long packets can spend in transit.). This is a fundamental
property of switched-network systems, because maybe our prior average
latencies were 100 microseconds but something just happened and this
one packet took 10 seconds or a central router died and we're now
suddenly trying to route 10GB/s of traffic through 3 GB/s of capacity
around in a ring. So the client needs to be able to deal with the OSD
completing an op that the client thinks it shouldn't have — that means
a simple timeout is just not sufficient to assume the op is no longer
going to happen. I think this already scuttles your plan.
( You can sort of work around this one, if you're very ambitious. The
server could do its best-guess about the "real" timeout and reject it
in a best-effort way, but the client wouldn't be able to start its
"definitely-canceled" timer until it got a *response* from the OSD
indicating the message came in. Given that we don't control the real
client's timeouts directly, I think that basically means this is
already impossible. )

Next, imagine a 10-second time-limited op came in with an acting set
of OSD 1, OSD 2, and it was replicated from OSD 1 to OSD 2. But then
OSD 1 (the primary) died. OSD 3 gets added to the PG after 15 seconds
(either it was down and came up, or it just got picked as the next
replica and is empty; doesn't matter). OSD 2 has *no idea* whether the
operation was ack'ed to the client. What is the correct behavior?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html