Re: Stretch cluster questions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Gregory,

thanks for the clarification.

> I'm not quite clear where the confusion is coming from here ...

Its because sometimes an important statement needs to be repeated in a way that emphasizes what it is really about:

>> Is this because the following: "the OSDs will only take PGs active when
>> they peer across data centers (or whatever other CRUSH bucket type you
>> specified), assuming both are alive"?

is not as obvious to understand as

> Right, so you just skipped over the part that it helps with: stretch
> mode *guarantees* that a PG has OSDs from both DCs in its acting set
> before the PG can finish peering.

I didn't skip it, I just didn't understand it in the way it is intended. And I seem not to be the only one. I would actually recommend to add this phrase with stress as in ...

Stretch mode *guarantees* that a PG has OSDs from *both* DCs in its acting set *before* the PG can finish peering and becomes writeable. This ensures that even if an active PG with active_size==min_size acknowledges a write, each DC has at least one most current copy available for successful recovery within a single DC should one DC go down.

... to the docs or substitute some text with this. With this also the automatic size- and min_size magic makes a lot more sense, albeit I would also enforce min_size>=2 in any case by an automatic method. Min_size=1 should always only be a manual choice. The 2 DC stretch mode sounds like you really really want to avoid a split brain and need some kind of geographically distant 3rd mini-DC for the tie-breaker monitor (thinking about the meteor case).

To get back to the question about why one would like more than one stretch rule even though they are all replicated. Its because stretch rules can have different device classes. If this is not supported yet, it seems important.

Thanks again for the reformulation of the point of the stretch mode. For 2DC set-ups it seems indeed important and it also sounds like stretched EC rules should be supported too, because a lot of people use these instead of something crazy like REP 6(4).

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Gregory Farnum <gfarnum@xxxxxxxxxx>
Sent: 17 May 2022 00:56:42
To: Frank Schilder
Cc: Eneko Lacunza; ceph-users
Subject: Re:  Re: Stretch cluster questions

I'm not quite clear where the confusion is coming from here, but there
are some misunderstandings. Let me go over it a bit:

On Tue, May 10, 2022 at 1:29 AM Frank Schilder <frans@xxxxxx> wrote:
>
> > What you are missing from stretch mode is that your CRUSH rule wouldn't
> > guarantee at least one copy in surviving room (min_size=2 can be
> > achieved with 2 copies in lost room).
>
> I'm afraid this deserves a bit more explanation. How would it be possible that, when both sites are up and with a 4(2) replicated rule, that a committed write does not guarantee all 4 copies to be present? As far as I understood the description of ceph's IO path, if all members of a PG are up, a write is only acknowledged to a client after all shards/copies have been committed to disk.

So in a perfectly normal PG of size 4, min_size 2, the OSDs are happy
to end peering and go active with only 2 up OSDs. That's what min_size
means. A PG won't serve IO until it's active, and it requires min_size
participants to do so — but once it's active, it acknowledges writes
once the live participants have written them down.

> In other words, with a 4(2) rule with 2 copies per DC, if one DC goes down you *always* have 2 life copies and still read access in the other DC. Setting min-size to 1 would allow write access too, albeit with a risk of data loss (a 4(2) rule is really not secure for a 2DC HA set-up as in degraded state you end up with 2(1) in 1 DC, its much better to use a wide EC profile with m>k to achieve redundant single-site writes).

Nope, there is no read access to a PG which doesn't have min_size
active copies. And if you have 4 *live* copies and lose a DC, yes, you
still have two copies. But consider an alternative scenario:
1) 2 copies in each of 2 DCs.
2) Two OSDs in DC 1 restart, which happens to share PG x.
3) PG x goes active with the remaining two OSDs in DC 2.

Does (3) make sense there?

So now add in step 4:
4) DC 2 gets hit by a meteor.

Now, you have no current copies of PG x because the only current
copies got hit by a meteor.

>
> The only situation I could imagine this not being guaranteed (both DCs holding 2 copies at all times in healthy condition) is that writes happen while one DC is down, the down DC comes up and the other DC goes down before recovery finishes. However, then stretch mode will not help either.

Right, so you just skipped over the part that it helps with: stretch
mode *guarantees* that a PG has OSDs from both DCs in its acting set
before the PG can finish peering. Redoing the scenario from before
1) 2 copies in each of 2 DCs,
2) Two OSDs in DC 1 restart, which happens to share PG x
3) PG x cannot go active because it lacks a replica in DC 1.
4) DC 2 gets hit by a meteor
5) All OSDs in DC 1 come back up
6) All PGs go active

So stretch mode adds another dimension to "the PG can finish peering
and go active" which includes the CRUSH buckets as a requirement, in
addition to a simple count of the replicas.

> My understanding of the useful part is, that stretch mode elects one monitor to be special and act as a tie-breaker in case a DC goes down or a split brain situation occurs between 2DCs. The change of min-size in the stretch-rule looks a bit useless and even dangerous to me. A stretched cluster should be designed to have a secure redundancy scheme per site and, for replicated rules, that would mean size=6, min_size=2 (degraded 3(2)). Much better seems to be something like an EC profile k=4, m=6 with 5 shards per DC, which has only 150% overhead compared with 500% overhead of a 6(2) replicated rule.

Yeah, the min_size change is because you don't want to become
unavailable when rebooting any of your surviving nodes. When you lost
a DC, you effectively go from running an ultra-secure 4-copy system to
a rather-less-secure 2-copy system. And generally people with 2-copy
systems want their data to still be available when doing maintenance.
;)
(Plus, well, hopefully you either get the other data center back
quickly, or you can expand the cluster to get to a nice safe 3-copy
system.)

But, yes. As Maximilian suggests, the use case for stretch mode is
pretty specific. If you're using RGW, you should be better-served by
its multisite feature, and if your application can stomach
asynchronous replication that will be much less expensive. RBD has
both sync and async replication options across clusters.
But sometimes you really just want exactly the same data in exactly
the same place at the same time. That's what stretch mode is for.
-Greg

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux