Re: iSCSI active/active stale io guard

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Fri, 23 Mar 2018 17:02:26 +0200

On 2018-03-23 15:22, David Disseldorp wrote:

Hi Maged,

On Mon, 19 Mar 2018 01:43:38 +0200, Maged Mokhtar wrote:

2) Guarded OSD Write Operations
We need to add new versions of 
CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME..
ops or support hints with existing versions to pass a source operation
time. A configurable timeout on the OSD server (osd_stale_op_timeout 
?)
will be used to reject stale write operations. A suggested default 
value
of 10 sec. False negatives not due to stale io will fail but will be
retried by client. Any operation time received in the future (greater
than ntp  max skew) should be rejected. These new operations are 
generic
enough and may be used outside of iSCSI.

I think a RADOS class function that offers request expiration on the 
OSD
would be helpful. However, aside from concerns around the client time
synchronisation dependence, I'm a little unsure how this should be
handled on the RBD client / iSCSI gw side. Prepending the expiry
operation to the OSD request prior to a write op would only catch stale
requests while being queued at the OSD, the subsequent write operation
could still be handled well after expiry. Ideally the expiration check
would be performed after the write, with rollback occurring on expiry.

Cheers, David

Hi David,

The iSCSI gateway will detect the time the initiator built the iSCSI cdb 
header packet at the start of the write operation then it will propagate 
this time down to krbd/tcmu-lbrbd which in turn will be send it with all 
OSD requests making up this write request. The method outlined uses TCP 
timestamps (RFC7323) + a simple method to create a time sync between 
client initiator and OSD server that is not dependent on gateway delays 
+ does not require ntp running on the client.

If the write request arrives at the OSD within the allowed time, for 
example within 10 sec, it will be allowed to proceed by this new guard 
condition. This is OK even if there is high queue delay/commit latency 
at the OSD. It is OK since our primary concern is solving the edge 
condition where a stale write could arrive at the OSD after newer writes 
were subsequently issued by the initiator, so potentially stale writes 
could overwrite on top of newer data. By making sure the initiator is 
configured to take longer than 10 sec to abort the task and retry it on 
a different path, we are sure we will not have the case of old data 
over-writing new data.

/Maged
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html