iSCSI active/active stale io guard

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Mon, 19 Mar 2018 01:43:38 +0200

The purpose of this write-up is to outline a method to handle stale 
writes that could occur when using iSCSI gateways with Ceph in an 
active/active setup. Though it is possible to use timers such as 
osd_request_timeout to expire stale io in most cases, in some edge cases 
the io could be congested in the target gateway either before hitting 
krbd/tcmu or within the network layer. Increasing path failover timeout 
at the client initiator (such as 
replacement_timeout/TimeOutValue/LinkDownTime/DefaultTimeToWait) is not 
guaranteed to solve the edge cases where inflight io could be stuck 
anywhere within the target stack.

The method outlined is comprised of 2 main ideas:
1) Calculate the source time a write operation was sent from the client 
as clocked by the client
2) Add/Enhance OSD write ops to accept a source operation time
In effect we try to remove the gateway's role in determining timeouts or 
attempting to guard stale ios itself. The setup requires time sync 
between the iSCSI gateway and the OSD server, there are no restriction 
on client initiator. Most gateway servers will be part of a Ceph cluster 
anyway, so the use of ntp is met in most cases.

1) Source operation time
The idea is to make use of TCP Timestamps (RFC 7323) sent from the 
initiator to deduce the time it stamped the ethernet frame/packet 
containing iSCSI header with the write opcode. TCP Timestamps are 
already used for round trip time estimates by the tcp stack, there are 
several fields in the tcp_sock struct for this. We need to add a new 
field

struct tcp_sock {
u32 recvmsg_tsval;

to hold the client timestamp at the current/last read stream position on 
the socket. The read function tcp_recvmsg() in tcp.c needs to update the 
recvmsg_tsval value as it iterates through the socket buffers, reading 
the timestamp from each packet.
In iscsi_target.c we need to record this timestamp in 
iscsi_target_rx_opcode() since this is the stream position at the start 
of a new op.

The source time at the client can be deduced from the tsval as follows:
source_time = tsval * freq  + offset;

if( now < source_time ) {
  /* min latency encountered, adjust/decrease offset */
  source_time = now;
  offset = now - tsval * freq ;
}

The freq/tick rate of the timestamping clock at the client can quickly 
be detected after a few seconds of io. Some observations using different 
client initiators:
Linux  1.0   jiffy
Win    0.25  Jiffies
ESX    2.5   Jiffies

Once the freq is detected, we use a calculated offset to compute the 
client source operation time, initially this will equal the gateway 
time, but we keep lowering the offset each time we get a min latency 
condition.
The accuracy error in deducing the source operation time relative to the 
OSD server is:
timestamp tick period + min latency encountered between client and 
gateway + ntp clock skew between gateway and OSD server.
This may not yield high precision timing, but is more than adequate for 
timeout calculations between client and OSD server.

2) Guarded OSD Write Operations
We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME.. 
ops or support hints with existing versions to pass a source operation 
time. A configurable timeout on the OSD server (osd_stale_op_timeout ?) 
will be used to reject stale write operations. A suggested default value 
of 10 sec. False negatives not due to stale io will fail but will be 
retried by client. Any operation time received in the future (greater 
than ntp  max skew) should be rejected. These new operations are generic 
enough and may be used outside of iSCSI.

Any comments/suggestions highly welcome.

Thanks /Maged
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html