The purpose of this write-up is to outline a method to handle stale
writes that could occur when using iSCSI gateways with Ceph in an
active/active setup. Though it is possible to use timers such as
osd_request_timeout to expire stale io in most cases, in some edge cases
the io could be congested in the target gateway either before hitting
krbd/tcmu or within the network layer. Increasing path failover timeout
at the client initiator (such as
replacement_timeout/TimeOutValue/LinkDownTime/DefaultTimeToWait) is not
guaranteed to solve the edge cases where inflight io could be stuck
anywhere within the target stack.
The method outlined is comprised of 2 main ideas:
1) Calculate the source time a write operation was sent from the client
as clocked by the client
2) Add/Enhance OSD write ops to accept a source operation time
In effect we try to remove the gateway's role in determining timeouts or
attempting to guard stale ios itself. The setup requires time sync
between the iSCSI gateway and the OSD server, there are no restriction
on client initiator. Most gateway servers will be part of a Ceph cluster
anyway, so the use of ntp is met in most cases.
1) Source operation time
The idea is to make use of TCP Timestamps (RFC 7323) sent from the
initiator to deduce the time it stamped the ethernet frame/packet
containing iSCSI header with the write opcode. TCP Timestamps are
already used for round trip time estimates by the tcp stack, there are
several fields in the tcp_sock struct for this. We need to add a new
field
struct tcp_sock {
u32 recvmsg_tsval;
to hold the client timestamp at the current/last read stream position on
the socket. The read function tcp_recvmsg() in tcp.c needs to update the
recvmsg_tsval value as it iterates through the socket buffers, reading
the timestamp from each packet.
In iscsi_target.c we need to record this timestamp in
iscsi_target_rx_opcode() since this is the stream position at the start
of a new op.
The source time at the client can be deduced from the tsval as follows:
source_time = tsval * freq + offset;
if( now < source_time ) {
/* min latency encountered, adjust/decrease offset */
source_time = now;
offset = now - tsval * freq ;
}
The freq/tick rate of the timestamping clock at the client can quickly
be detected after a few seconds of io. Some observations using different
client initiators:
Linux 1.0 jiffy
Win 0.25 Jiffies
ESX 2.5 Jiffies
Once the freq is detected, we use a calculated offset to compute the
client source operation time, initially this will equal the gateway
time, but we keep lowering the offset each time we get a min latency
condition.
The accuracy error in deducing the source operation time relative to the
OSD server is:
timestamp tick period + min latency encountered between client and
gateway + ntp clock skew between gateway and OSD server.
This may not yield high precision timing, but is more than adequate for
timeout calculations between client and OSD server.
2) Guarded OSD Write Operations
We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME..
ops or support hints with existing versions to pass a source operation
time. A configurable timeout on the OSD server (osd_stale_op_timeout ?)
will be used to reject stale write operations. A suggested default value
of 10 sec. False negatives not due to stale io will fail but will be
retried by client. Any operation time received in the future (greater
than ntp max skew) should be rejected. These new operations are generic
enough and may be used outside of iSCSI.
Any comments/suggestions highly welcome.
Thanks /Maged
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html