Hi,
while I basically agree with Frank's response (e. g. min_size = 2) I
disagree that it won't work without the stretch mode. We have a
customer with a similar setup, two datacenters and a third mon in a
different location. And this setup has proven multiple times the
resiliency of ceph. Due to hardware issues in the power supplies they
experiences two or three power outages in one DC without data loss.
They use an erasure coded pool stretched across these two DCs, the
third mon is reachable both ways around the DCs, of course. But this
works quite well, they were very happy with ceph's resiliency. The
cluster is still running on Nautilus.
Regards,
Eugen
Zitat von Frank Schilder <frans@xxxxxx>:
Hi Dale,
we thought we had set it up to prevent.. and with size = 4 and
min_size set = 1
I'm afraid this is exactly what you didn't. Firstly, min_size=1 is
always a bad idea. Secondly, if you have 2 data centres, the only
way to get this to work is to use stretch mode. Even if you had
min_size=2 (which, by the way you should have in any case), without
stretch mode you would not be guaranteed that you have all PGs
active+clean after one DC goes down (or cable gets cut). There is a
quite long and very detailed explanation of why this is the case and
with min_size=1 you are very certain to hit one of these cases or
even loose data.
What you could check in your situation are these two:
mon_osd_min_up_ratio
mon_osd_min_in_ratio
My guess is that these prevented the mons from marking sufficiently
many OSDs as out and therefore they got stuck peering (maybe even
nothing was marked down?). The other thing is that you almost
certainly had exactly the split brain situation that stretch mode is
there to prevent. You probably ended up with 2 sub-clusters with 2
mons each and now what? If the third mon could still see the other 2
I don't think you get a meaningful quorum. Stretch mode will
actually change the crush rule depending on a decision by the
tie-breaking monitor to re-configure the pool to use only OSDs in
one of the 2 DCs so that no cross-site peering happens.
Maybe if you explicitly shut down one of the DC-mons you get stuff
to work in one of the DCs?
Without stretch mode you need 3 DCs and a geo-replicated 3(2) pool.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Wolfpaw - Dale Corse <dale@xxxxxxxxxxx>
Sent: 29 November 2022 07:20:20
To: 'ceph-users'
Subject: PGs stuck down
Hi All,
We had a fiber cut tonight between 2 data centers, and a ceph cluster didn't
do very well :( We ended up with 98% of PGs as down.
This setup has 2 data centers defined, with 4 copies across both, and a
minimum of size of 1. We have 1 mon/mgr in each DC, with one in a 3rd data
center connected to each of the other 2 by VPN.
When I did a pg query on the PG's that were stuck it said they were blocked
from coming up because they couldn't contact 2 of the OSDs (located in the
other data center that it was unable to reach).. but the other 2 were fine.
I'm at a loss because this was exactly the thing we thought we had set it up
to prevent.. and with size = 4 and min_size set = 1 I understood that it
would continue without a problem? :(
Crush map is below .. if anyone has any ideas? I would sincerely appreciate
it :)
Thanks!
Dale
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable straw_calc_version 1
# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
device 2 osd.2 class ssd
device 3 osd.3 class ssd
device 4 osd.4 class ssd
device 5 osd.5 class ssd
device 6 osd.6 class ssd
device 7 osd.7 class ssd
device 8 osd.8 class ssd
device 9 osd.9 class ssd
device 10 osd.10 class ssd
device 11 osd.11 class ssd
device 12 osd.12 class ssd
device 13 osd.13 class ssd
device 14 osd.14 class ssd
device 15 osd.15 class ssd
device 16 osd.16 class ssd
device 17 osd.17 class ssd
device 18 osd.18 class ssd
device 19 osd.19 class ssd
device 20 osd.20 class ssd
device 21 osd.21 class ssd
device 22 osd.22 class ssd
device 23 osd.23 class ssd
device 24 osd.24 class ssd
device 25 osd.25 class ssd
device 26 osd.26 class ssd
device 27 osd.27 class ssd
device 28 osd.28 class ssd
device 29 osd.29 class ssd
device 30 osd.30 class ssd
device 31 osd.31 class ssd
device 32 osd.32 class ssd
device 33 osd.33 class ssd
device 34 osd.34 class ssd
device 35 osd.35 class ssd
device 36 osd.36 class ssd
device 37 osd.37 class ssd
device 38 osd.38 class ssd
device 39 osd.39 class ssd
device 40 osd.40 class ssd
device 41 osd.41 class ssd
device 42 osd.42 class ssd
device 43 osd.43 class ssd
device 44 osd.44 class ssd
device 45 osd.45 class ssd
device 46 osd.46 class ssd
device 47 osd.47 class ssd
device 49 osd.49 class ssd
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host Pnode01 {
id -8 # do not change unnecessarily
id -9 class ssd # do not change unnecessarily
# weight 0.000
alg straw2
hash 0 # rjenkins1
}
host node01 {
id -2 # do not change unnecessarily
id -15 class ssd # do not change unnecessarily
# weight 14.537
alg straw2
hash 0 # rjenkins1
item osd.4 weight 1.817
item osd.1 weight 1.817
item osd.3 weight 1.817
item osd.2 weight 1.817
item osd.6 weight 1.817
item osd.9 weight 1.817
item osd.5 weight 1.817
item osd.0 weight 1.818
}
host node02 {
id -3 # do not change unnecessarily
id -16 class ssd # do not change unnecessarily
# weight 14.536
alg straw2
hash 0 # rjenkins1
item osd.10 weight 1.817
item osd.11 weight 1.817
item osd.12 weight 1.817
item osd.13 weight 1.817
item osd.14 weight 1.817
item osd.15 weight 1.817
item osd.16 weight 1.817
item osd.19 weight 1.817
}
host node03 {
id -4 # do not change unnecessarily
id -17 class ssd # do not change unnecessarily
# weight 14.536
alg straw2
hash 0 # rjenkins1
item osd.20 weight 1.817
item osd.21 weight 1.817
item osd.22 weight 1.817
item osd.23 weight 1.817
item osd.25 weight 1.817
item osd.26 weight 1.817
item osd.29 weight 1.817
item osd.24 weight 1.817
}
datacenter EDM1 {
id -11 # do not change unnecessarily
id -14 class ssd # do not change unnecessarily
# weight 43.609
alg straw
hash 0 # rjenkins1
item node01 weight 14.537
item node02 weight 14.536
item node03 weight 14.536
}
host node04 {
id -5 # do not change unnecessarily
id -18 class ssd # do not change unnecessarily
# weight 14.536
alg straw2
hash 0 # rjenkins1
item osd.30 weight 1.817
item osd.31 weight 1.817
item osd.32 weight 1.817
item osd.33 weight 1.817
item osd.34 weight 1.817
item osd.35 weight 1.817
item osd.36 weight 1.817
item osd.39 weight 1.817
}
host node05 {
id -6 # do not change unnecessarily
id -19 class ssd # do not change unnecessarily
# weight 14.536
alg straw2
hash 0 # rjenkins1
item osd.40 weight 1.817
item osd.41 weight 1.817
item osd.42 weight 1.817
item osd.43 weight 1.817
item osd.44 weight 1.817
item osd.45 weight 1.817
item osd.46 weight 1.817
item osd.49 weight 1.817
}
host node06 {
id -7 # do not change unnecessarily
id -20 class ssd # do not change unnecessarily
# weight 16.353
alg straw2
hash 0 # rjenkins1
item osd.47 weight 1.817
item osd.37 weight 1.817
item osd.27 weight 1.817
item osd.38 weight 1.817
item osd.7 weight 1.817
item osd.28 weight 1.817
item osd.8 weight 1.817
item osd.17 weight 1.817
item osd.18 weight 1.817
}
datacenter EDM3 {
id -12 # do not change unnecessarily
id -13 class ssd # do not change unnecessarily
# weight 45.425
alg straw
hash 0 # rjenkins1
item node04 weight 14.536
item node05 weight 14.536
item node06 weight 16.353
}
datacenter EDM2 {
id -10 # do not change unnecessarily
id -22 class ssd # do not change unnecessarily
# weight 0.000
alg straw
hash 0 # rjenkins1
}
root default {
id -1 # do not change unnecessarily
id -21 class ssd # do not change unnecessarily
# weight 89.034
alg straw2
hash 0 # rjenkins1
item Pnode01 weight 0.000
item EDM1 weight 43.609
item EDM3 weight 45.425
item EDM2 weight 0.000
}
# rules
rule replicated_ruleset {
id 0
type replicated
min_size 1
max_size 10
step take default
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
}
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx