Re: Stretch cluster questions

Maximilian Hill <max@xxxxxxxxxx> · Tue, 10 May 2022 12:45:22 +0200

>The only situation I could imagine this not being guaranteed (both DCs holding 2 copies at all times in healthy condition) is that writes happen while one DC is down, the down DC comes up and the other DC goes down before recovery finishes. However, then stretch mode will not help either.

That's exactly, why I didn't get the benefits of the stretch mode. According to the docs, Ceph writes as much copies as possible (to as much target buckets as available) before the client receives the acknowledgement.

EC doesn't seem to be possible with stretch at the moment. Even worse, once stretch mode is activated, it seems impossible to do erasure coding.

When you're going for 6(2), you could as well spin up two independent Ceph clusters with 3(2) or EC each and sync between them, depending on the intended use.

Best regards

On May 10, 2022 10:28:22 AM GMT+02:00, Frank Schilder <frans@xxxxxx> wrote:
>> What you are missing from stretch mode is that your CRUSH rule wouldn't
>> guarantee at least one copy in surviving room (min_size=2 can be
>> achieved with 2 copies in lost room).
>
>I'm afraid this deserves a bit more explanation. How would it be possible that, when both sites are up and with a 4(2) replicated rule, that a committed write does not guarantee all 4 copies to be present? As far as I understood the description of ceph's IO path, if all members of a PG are up, a write is only acknowledged to a client after all shards/copies have been committed to disk.
>
>In other words, with a 4(2) rule with 2 copies per DC, if one DC goes down you *always* have 2 life copies and still read access in the other DC. Setting min-size to 1 would allow write access too, albeit with a risk of data loss (a 4(2) rule is really not secure for a 2DC HA set-up as in degraded state you end up with 2(1) in 1 DC, its much better to use a wide EC profile with m>k to achieve redundant single-site writes).
>
>The only situation I could imagine this not being guaranteed (both DCs holding 2 copies at all times in healthy condition) is that writes happen while one DC is down, the down DC comes up and the other DC goes down before recovery finishes. However, then stretch mode will not help either.
>
>My understanding of the useful part is, that stretch mode elects one monitor to be special and act as a tie-breaker in case a DC goes down or a split brain situation occurs between 2DCs. The change of min-size in the stretch-rule looks a bit useless and even dangerous to me. A stretched cluster should be designed to have a secure redundancy scheme per site and, for replicated rules, that would mean size=6, min_size=2 (degraded 3(2)). Much better seems to be something like an EC profile k=4, m=6 with 5 shards per DC, which has only 150% overhead compared with 500% overhead of a 6(2) replicated rule.
>
>Best regards,
>=================
>Frank Schilder
>AIT Risø Campus
>Bygning 109, rum S14
>
>________________________________________
>From: Eneko Lacunza <elacunza@xxxxxxxxx>
>Sent: 09 May 2022 11:45
>To: Maximilian Hill
>Cc: ceph-users
>Subject:  Re: Stretch cluster questions
>
>Hi,
>
>El 9/5/22 a las 11:05, Maximilian Hill escribió:
>> sure, let's say min_size is 2 (crush rule and pool).
>>
>> This would also allow me to read and write, when a site is down and would be able to recover, when the site comes back. Is there any more to stretch mode, or is this essentially it?
>>
>> I'm just wondering, if the magic is happening in stretch mode itself, or the other actions to take according to the documentation.
>
>What you are missing from stretch mode is that your CRUSH rule wouldn't
>guarantee at least one copy in surviving room (min_size=2 can be
>achieved with 2 copies in lost room). You may lose data until room is
>recovered (no copies present), or you may have blocked I/O until a
>second copy is peered in surviving room. You would need size=4/min=3 for
>having at least one guaranteed copy in each room; but I/O would block
>until min size is adjusted when room is lost.
>
>Also surviving room will have blocked I/O is an OSD crashes/reboots etc.
>until you adjust min_size.
>
>Cheers
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx