Re: iSCSI active/active stale io guard

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 2 Apr 2018 11:00:09 -0700

On Fri, Mar 23, 2018 at 8:02 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
> On 2018-03-23 15:22, David Disseldorp wrote:
>
>> Hi Maged,
>>
>> On Mon, 19 Mar 2018 01:43:38 +0200, Maged Mokhtar wrote:
>>
>>> 2) Guarded OSD Write Operations
>>> We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME..
>>> ops or support hints with existing versions to pass a source operation
>>> time. A configurable timeout on the OSD server (osd_stale_op_timeout ?)
>>> will be used to reject stale write operations. A suggested default value
>>> of 10 sec. False negatives not due to stale io will fail but will be
>>> retried by client. Any operation time received in the future (greater
>>> than ntp  max skew) should be rejected. These new operations are generic
>>> enough and may be used outside of iSCSI.
>>
>>
>> I think a RADOS class function that offers request expiration on the OSD
>> would be helpful. However, aside from concerns around the client time
>> synchronisation dependence, I'm a little unsure how this should be
>> handled on the RBD client / iSCSI gw side. Prepending the expiry
>> operation to the OSD request prior to a write op would only catch stale
>> requests while being queued at the OSD, the subsequent write operation
>> could still be handled well after expiry. Ideally the expiration check
>> would be performed after the write, with rollback occurring on expiry.
>>
>> Cheers, David
>
>
> Hi David,
>
> The iSCSI gateway will detect the time the initiator built the iSCSI cdb
> header packet at the start of the write operation then it will propagate
> this time down to krbd/tcmu-lbrbd which in turn will be send it with all OSD
> requests making up this write request. The method outlined uses TCP
> timestamps (RFC7323) + a simple method to create a time sync between client
> initiator and OSD server that is not dependent on gateway delays + does not
> require ntp running on the client.
>
> If the write request arrives at the OSD within the allowed time, for example
> within 10 sec, it will be allowed to proceed by this new guard condition.
> This is OK even if there is high queue delay/commit latency at the OSD. It
> is OK since our primary concern is solving the edge condition where a stale
> write could arrive at the OSD after newer writes were subsequently issued by
> the initiator, so potentially stale writes could overwrite on top of newer
> data. By making sure the initiator is configured to take longer than 10 sec
> to abort the task and retry it on a different path, we are sure we will not
> have the case of old data over-writing new data.

I don't think timeouts are a good path forward for this. Among other
things, I'm concerned about how they'd interact with OSD recovery and
peering. Not to mention, what happens when an OSD and a client
disagree about whether an op was discarded due to timeout?

In the iSCSI case, if a 10-second wait period is acceptable to begin
with then it seems much simpler and less complicated for the
"failover" initiator to blacklist the failed one and force all the
OSDs to assimilate that blacklist before processing ops?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html