Re: iSCSI active/active stale io guard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2018-04-02 20:00, Gregory Farnum wrote:

On Fri, Mar 23, 2018 at 8:02 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote: On 2018-03-23 15:22, David Disseldorp wrote:

Hi Maged,

On Mon, 19 Mar 2018 01:43:38 +0200, Maged Mokhtar wrote:

2) Guarded OSD Write Operations
We need to add new versions of CEPH_OSD_OP_WRITE/CEPH_OSD_OP_WRITESAME..
ops or support hints with existing versions to pass a source operation
time. A configurable timeout on the OSD server (osd_stale_op_timeout ?)
will be used to reject stale write operations. A suggested default value
of 10 sec. False negatives not due to stale io will fail but will be
retried by client. Any operation time received in the future (greater
than ntp max skew) should be rejected. These new operations are generic
enough and may be used outside of iSCSI.

I think a RADOS class function that offers request expiration on the OSD
would be helpful. However, aside from concerns around the client time
synchronisation dependence, I'm a little unsure how this should be
handled on the RBD client / iSCSI gw side. Prepending the expiry
operation to the OSD request prior to a write op would only catch stale
requests while being queued at the OSD, the subsequent write operation
could still be handled well after expiry. Ideally the expiration check
would be performed after the write, with rollback occurring on expiry.

Cheers, David

Hi David,

The iSCSI gateway will detect the time the initiator built the iSCSI cdb
header packet at the start of the write operation then it will propagate
this time down to krbd/tcmu-lbrbd which in turn will be send it with all OSD
requests making up this write request. The method outlined uses TCP
timestamps (RFC7323) + a simple method to create a time sync between client initiator and OSD server that is not dependent on gateway delays + does not
require ntp running on the client.

If the write request arrives at the OSD within the allowed time, for example within 10 sec, it will be allowed to proceed by this new guard condition. This is OK even if there is high queue delay/commit latency at the OSD. It is OK since our primary concern is solving the edge condition where a stale write could arrive at the OSD after newer writes were subsequently issued by the initiator, so potentially stale writes could overwrite on top of newer data. By making sure the initiator is configured to take longer than 10 sec to abort the task and retry it on a different path, we are sure we will not
have the case of old data over-writing new data.
I don't think timeouts are a good path forward for this. Among other
things, I'm concerned about how they'd interact with OSD recovery and
peering. Not to mention, what happens when an OSD and a client
disagree about whether an op was discarded due to timeout?

In the iSCSI case, if a 10-second wait period is acceptable to begin
with then it seems much simpler and less complicated for the
"failover" initiator to blacklist the failed one and force all the
OSDs to assimilate that blacklist before processing ops?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



Hi Greg

You are right, the 10 sec is probably not a practical value to account for OSD failover, i think maybe 25-30 sec should be more reasonable. This OSD timeout can be adjusted via osd_heartbeat_grace + osd_heartbeat_interval.
The real reason for the proposed solution is to handle cases when a path
failover occurs yet the original target node is not dead and could have
inflight io stuck anywhere in its stack due to congestion, flaky network
connections as well as the (more common) OSD down case. Such cases may lead in some extreme conditions for stale io to potentially overwrite newer io
after a path failover, these cases were very well described by Mike in:
https://www.spinics.net/lists/ceph-users/msg43402.html

The reason we do not perform initiator side blacklisting, is to support non Linux clients such as VMWare ESX and Windows. It could be possible to write custom client code on such platforms but it will be much simpler and more generic to do it via timeouts, although it may not the most elagant solution. iSCSI MPIO does not provide any means for a target to detect if a command received was a retry from another failed path, iSCSI MCS (Multiple Connections per Session) supports this but is neither supported by the Linux iSCSI target
nor by the VMWare initiator.

if osd_stale_op_timeout = max allowed inflight time between client and osd
osd_heartbeat_grace + osd_heartbeat_interval < osd_stale_op_timeout
osd_stale_op_timeout < replacement_timeout Linux open-iscsi initiator
osd_stale_op_timeout < RecoveryTimeout VMWare/ESX initiator
osd_stale_op_timeout < LinkDownTime Windows initiator

If an OSD rejects a valid io that, due to high latency, was received after osd_stale_op_timeout, this would be a false positive rejection and the command will fail back to the initiator which will retry the command (depending on the client this could happen at the upper scsi or multipath layers). A stale io
will not be retried.

Maged
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux