Re: iSCSI active/active stale io guard

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 3 Apr 2018 23:11:26 -0700

On Tue, Apr 3, 2018 at 10:27 AM, Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:
>> Ah sorry, I got my language wrong. I meant blacklisting the iSCSI
>> target; the initiator doesn't exist as far as Ceph is concerned. :)
>>
>>> if osd_stale_op_timeout = max allowed inflight time between client and
>>> osd
>>> osd_heartbeat_grace + osd_heartbeat_interval < osd_stale_op_timeout
>>> osd_stale_op_timeout < replacement_timeout Linux open-iscsi initiator
>>> osd_stale_op_timeout < RecoveryTimeout VMWare/ESX initiator
>>> osd_stale_op_timeout < LinkDownTime Windows initiator
>>>
>>> If an OSD rejects a valid io that, due to high latency, was received
>>> after
>>> osd_stale_op_timeout, this would be a false positive rejection and the
>>> command
>>> will fail back to the initiator which will retry the command (depending
>>> on
>>> the
>>> client this could happen at the upper scsi or multipath layers). A stale
>>> io
>>> will not be retried.
>>
>>
>> Anyway, I don't mean the timeouts are a problem because peering takes
>> time. I mean defining and understanding how to handle them during
>> transitions is hard, verging on impossible.
>>
>> First of all, once an op is submitted to the OSD, you can't really
>> undo it. There is no "max allowed inflight time" and people go to a
>> great deal of trouble trying to simulate having that property or
>> writing code to pretend one exists, and then just ending the world if
>> somehow the network exceeds that time (real-world networks exceed any
>> given time you want to propose. They suck. It's impossible to believe
>> how long packets can spend in transit.). This is a fundamental
>> property of switched-network systems, because maybe our prior average
>> latencies were 100 microseconds but something just happened and this
>> one packet took 10 seconds or a central router died and we're now
>> suddenly trying to route 10GB/s of traffic through 3 GB/s of capacity
>> around in a ring. So the client needs to be able to deal with the OSD
>> completing an op that the client thinks it shouldn't have — that means
>> a simple timeout is just not sufficient to assume the op is no longer
>> going to happen. I think this already scuttles your plan.
>> ( You can sort of work around this one, if you're very ambitious. The
>> server could do its best-guess about the "real" timeout and reject it
>> in a best-effort way, but the client wouldn't be able to start its
>> "definitely-canceled" timer until it got a *response* from the OSD
>> indicating the message came in. Given that we don't control the real
>> client's timeouts directly, I think that basically means this is
>> already impossible. )

Okay, I realize now that you want to rely on having time sync across
the cluster. That isn't impossible but is something we've shied away
from in the past and will probably be your biggest barrier to getting
something like this merged.

I guess it does mean you aren't so worried about the outlier transit
time here, as long as your timeouts are sufficiently larger than the
time error is. So maybe there's a plausible solution to the problem.

>>
>> Next, imagine a 10-second time-limited op came in with an acting set
>> of OSD 1, OSD 2, and it was replicated from OSD 1 to OSD 2. But then
>> OSD 1 (the primary) died. OSD 3 gets added to the PG after 15 seconds
>> (either it was down and came up, or it just got picked as the next
>> replica and is empty; doesn't matter). OSD 2 has *no idea* whether the
>> operation was ack'ed to the client. What is the correct behavior?
>> -Greg
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> Hi Greg,
>
> I think i need to clarify what this solutions solves. Its purpose is to
> deal with a very specific corner case which can occur if you setup iSCSI
> gateways with Ceph in a scale-out active/active configuration where all
> gateways serve the same iSCSI disk/lun. This edge case was discussed
> earlier within the mailing list as per the link in my prev post.
> I will summarize it here as follows:
> 1) Client sends a write op to iscsi gateway A but does not get a reply.
> Gateway A is not responding but is not down and has stale io inflight.
> 2) Client times out and sends the op to gateway B which returns quickly
> 3) Client sends a new write op to gateway C, the new op overlaps the
> earlier write op sector/extent, the new data is written quickly
> 4) Gateway A becomes active and its old data overwrites new data.
>
> The proposed solution is specific to this case: it stops old data over
> writing new data. it is not meant to be a generic method to abort inflight
> OSD or iSCSI operations.

Yeah, I understand that, but you still need to define the complete
Ceph behavior. In the case 2 I posited, it sounds like once something
goes to the replicas it's okay for that to be committed data. That's
good; it's a defined behavior that's easy to implement and understand.

Going forward, I'd recommend the guarded write have an expiry time
rather than a source time — clients are a lot closer to understanding
the relevant timeouts than the OSDs are and a misconfiguration would
be very bad, so just take it out of the picture.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html