iSCSI active/active stale io guard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The purpose of this write-up is to outline a method to handle stale writes that could occur when using iSCSI gateways with Ceph in an active/active setup. Though it is possible to use timers such as osd_request_timeout to expire stale io in most cases, in some edge cases the io could be congested in the target gateway either before hitting krbd/tcmu or within the network layer. Increasing path failover timeout at the client initiator (such as replacement_timeout/TimeOutValue/LinkDownTime/DefaultTimeToWait) is not guaranteed to solve the edge cases where inflight io could be stuck anywhere within the target stack.

The method outlined is comprised of 2 main ideas:
1) Calculate the source time a write operation was sent from the client as clocked by the client
2) Add/Enhance OSD write ops to accept a source operation time
In effect we try to remove the gateway's role in determining timeouts or attempting to guard stale ios itself. The setup requires time sync between the iSCSI gateway and the OSD server, there are no restriction on client initiator. Most gateway servers will be part of a Ceph cluster anyway, so the use of ntp is met in most cases.

1) Source operation time
The idea is to make use of TCP Timestamps (RFC 7323) sent from the initiator to deduce the time it stamped the ethernet frame/packet containing iSCSI header with the write opcode. TCP Timestamps are already used for round trip time estimates by the tcp stack, there are several fields in the tcp_sock struct for this. We need to add a new field

struct tcp_sock {
u32 recvmsg_tsval;

to hold the client timestamp at the current/last read stream position on the socket. The read function tcp_recvmsg() in tcp.c needs to update the recvmsg_tsval value as it iterates through the socket buffers, reading the timestamp from each packet. In iscsi_target.c we need to record this timestamp in iscsi_target_rx_opcode() since this is the stream position at the start of a new op.

The source time at the client can be deduced from the tsval as follows:
source_time = tsval * freq  + offset;

if( now < source_time ) {
  /* min latency encountered, adjust/decrease offset */
  source_time = now;
  offset = now - tsval * freq ;
}

The freq/tick rate of the timestamping clock at the client can quickly be detected after a few seconds of io. Some observations using different client initiators:
Linux  1.0   jiffy
Win    0.25  Jiffies
ESX    2.5   Jiffies

Once the freq is detected, we use a calculated offset to compute the client source operation time, initially this will equal the gateway time, but we keep lowering the offset each time we get a min latency condition. The accuracy error in deducing the source operation time relative to the OSD server is: timestamp tick period + min latency encountered between client and gateway + ntp clock skew between gateway and OSD server. This may not yield high precision timing, but is more than adequate for timeout calculations between client and OSD server.


2) Guarded OSD Write Operations
We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME.. ops or support hints with existing versions to pass a source operation time. A configurable timeout on the OSD server (osd_stale_op_timeout ?) will be used to reject stale write operations. A suggested default value of 10 sec. False negatives not due to stale io will fail but will be retried by client. Any operation time received in the future (greater than ntp max skew) should be rejected. These new operations are generic enough and may be used outside of iSCSI.

Any comments/suggestions highly welcome.

Thanks /Maged
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux