Re: Stretch cluster questions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Enko.

> You're thinking about a cluster in healh_ok state before DC goes down. That may not be the case.
> If you have a pair of hosts in the surviving DC down when a DC goes down and peering has not
> completed, you have IO but only 2 copies (min_size=2) in the lost DC.
> 
> At least this is what I understand about the matter. :)

That's an interesting corner case. Again, my understanding of ceph's IO path is, that peering PGs do *not* acknowledge writes. Hence, there should be *no* copy in the down DC, because the *unpeered* PGs with two up OSDs and 2 unknown will not receive writes. During peering IO to affected PGs stalls, at least that's what I see on my cluster. Once the DC is down and all PGs complete peering, everything should proceed unless you try to target a PG that has a copy on the down hosts. But this is a much more degraded situation (1DC +2 hosts in other DC down) than a simple stretch rule can handle. So, expected is that IO stops unless you have a sufficiently wide replication profile that can handle a DC+2 hosts down.

Again, I don't see how stretch mode on the rule helps to change any of this to the better.

However, if my understanding of IO to unpeered PGs is incorrect, the situation might be different.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Eneko Lacunza <elacunza@xxxxxxxxx>
Sent: 10 May 2022 10:44
To: Frank Schilder
Cc: ceph-users
Subject: Re:  Re: Stretch cluster questions

Hi,

El 10/5/22 a las 10:28, Frank Schilder escribió:

What you are missing from stretch mode is that your CRUSH rule wouldn't
guarantee at least one copy in surviving room (min_size=2 can be
achieved with 2 copies in lost room).



I'm afraid this deserves a bit more explanation. How would it be possible that, when both sites are up and with a 4(2) replicated rule, that a committed write does not guarantee all 4 copies to be present? As far as I understood the description of ceph's IO path, if all members of a PG are up, a write is only acknowledged to a client after all shards/copies have been committed to disk.

In other words, with a 4(2) rule with 2 copies per DC, if one DC goes down you *always* have 2 life copies and still read access in the other DC. Setting min-size to 1 would allow write access too, albeit with a risk of data loss (a 4(2) rule is really not secure for a 2DC HA set-up as in degraded state you end up with 2(1) in 1 DC, its much better to use a wide EC profile with m>k to achieve redundant single-site writes).

You're thinking about a cluster in healh_ok state before DC goes down. That may not be the case. If you have a pair of hosts in the surviving DC down when a DC goes down and peering has not completed, you have IO but only 2 copies (min_size=2) in the lost DC.

At least this is what I understand about the matter. :)




The only situation I could imagine this not being guaranteed (both DCs holding 2 copies at all times in healthy condition) is that writes happen while one DC is down, the down DC comes up and the other DC goes down before recovery finishes. However, then stretch mode will not help either.

My understanding of the useful part is, that stretch mode elects one monitor to be special and act as a tie-breaker in case a DC goes down or a split brain situation occurs between 2DCs. The change of min-size in the stretch-rule looks a bit useless and even dangerous to me. A stretched cluster should be designed to have a secure redundancy scheme per site and, for replicated rules, that would mean size=6, min_size=2 (degraded 3(2)). Much better seems to be something like an EC profile k=4, m=6 with 5 shards per DC, which has only 150% overhead compared with 500% overhead of a 6(2) replicated rule.

You define the special monitor in the stretch cluster setup.

The changes I suggested to Gregory would allow a size=6 min_size=3 (degraded 6/2) scenario. https://tracker.ceph.com/issues/55573

Although I'm considering that maybe size=6,min_size=4 (degraded 6/2) would be better

Cheers,


Eneko Lacunza
Zuzendari teknikoa | Director técnico
Binovo IT Human Project

Tel. +34 943 569 206 | https://www.binovo.es
Astigarragako Bidea, 2 - 2º izda. Oficina 10-11, 20180 Oiartzun

https://www.youtube.com/user/CANALBINOVO
https://www.linkedin.com/company/37269706/

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux