Hi, On 01/09/2013 01:53 AM, Chen, Xiaoxi wrote: > Hi, > Setting rep size to 3 only make the data triple-replication, that means when you "fail" all OSDs in 2 out of 3 DCs, the data still accessable. > But Monitor is another story, for monitor clusters with 2N+1 nodes, it require at least N+1 nodes alive, and indeed this is why you Ceph failed. > It looks to me this discipline make it hard to design a proper deployment which is robust in DC outage. But hoping for inputs from community,how to make Monitor cluster reliable. > >From what I understand he didn't kill the second mon, still leaving 2 out of 3 mons running. Could you check if your PGs are actually mapped to OSDs spread out over the 3 DCs? "ceph pg dump" should tell you to which OSDs the PGs are mapped. I've never tried before, but you don't have equal weights for the datacenters, I don't know how that effects the situation. Wido > Xiaoxi > > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Moore, Shawn M > Sent: 2013年1月9日 4:21 > To: ceph-devel@xxxxxxxxxxxxxxx > Subject: Crushmap Design Question > > I have been testing ceph for a little over a month now. Our design goal is to have 3 datacenters in different buildings all tied together over 10GbE. Currently there are 10 servers each serving 1 osd in 2 of the datacenters. In the third is one large server with 16 SAS disks serving 8 osds. Eventually we will add one more identical large server into the third datacenter. I have told ceph to keep 3 copies and tried to do the crushmap in such a way that as long as a majority of mon's can stay up, we could run off of one datacenter's worth of osds. So in my testing, it doesn't work out quite this way... > > Everything is currently ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7) > > I will put hopefully relevant files at the end of this email. > > When all 28 osds are up, I get: > 2013-01-08 13:56:07.435914 mon.0 [INF] pgmap v2712076: 7104 pgs: 7104 active+clean; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail > > When I fail a datacenter (including 1 of 3 mon's) I eventually get: > 2013-01-08 13:58:54.020477 mon.0 [INF] pgmap v2712139: 7104 pgs: 7104 active+degraded; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail; 16362/49086 degraded (33.333%) > > At this point everything is still ok. But when I fail the 2nd datacenter (still leaving 2 out of 3 mons running) I get: > 2013-01-08 14:01:25.600056 mon.0 [INF] pgmap v2712189: 7104 pgs: 7104 incomplete; 60264 MB data, 137 GB used, 13570 GB / 14146 GB avail > > Most VM's quit working and "rbd ls" works, but not a single line from "rados -p rbd ls" works and the command hangs. Now after a while (you can see from timestamps) I end up at and stays this way: > 2013-01-08 14:40:54.030370 mon.0 [INF] pgmap v2713794: 7104 pgs: 213 active, 117 active+remapped, 3660 incomplete, 3108 active+degraded+remapped, 6 remapped+incomplete; 60264 MB data, 65701 MB used, 4604 GB / 4768 GB avail; 7696/49086 degraded (15.679%) > > I'm hoping I've done something wrong, so please advise. Below are my configs. If you need something more to help, just ask. > > Normal output with all datacenters up. > # ceph osd tree > # id weight type name up/down reweight > -1 80 root default > -3 36 datacenter hok > -2 1 host blade151 > 0 1 osd.0 up 1 > -4 1 host blade152 > 1 1 osd.1 up 1 > -15 1 host blade153 > 2 1 osd.2 up 1 > -17 1 host blade154 > 3 1 osd.3 up 1 > -18 1 host blade155 > 4 1 osd.4 up 1 > -19 1 host blade159 > 5 1 osd.5 up 1 > -20 1 host blade160 > 6 1 osd.6 up 1 > -21 1 host blade161 > 7 1 osd.7 up 1 > -22 1 host blade162 > 8 1 osd.8 up 1 > -23 1 host blade163 > 9 1 osd.9 up 1 > -24 36 datacenter csc > -5 1 host admbc0-01 > 10 1 osd.10 up 1 > -6 1 host admbc0-02 > 11 1 osd.11 up 1 > -7 1 host admbc0-03 > 12 1 osd.12 up 1 > -8 1 host admbc0-04 > 13 1 osd.13 up 1 > -9 1 host admbc0-05 > 14 1 osd.14 up 1 > -10 1 host admbc0-06 > 15 1 osd.15 up 1 > -11 1 host admbc0-09 > 16 1 osd.16 up 1 > -12 1 host admbc0-10 > 17 1 osd.17 up 1 > -13 1 host admbc0-11 > 18 1 osd.18 up 1 > -14 1 host admbc0-12 > 19 1 osd.19 up 1 > -25 8 datacenter adm > -16 8 host admdisk0 > 20 1 osd.20 up 1 > 21 1 osd.21 up 1 > 22 1 osd.22 up 1 > 23 1 osd.23 up 1 > 24 1 osd.24 up 1 > 25 1 osd.25 up 1 > 26 1 osd.26 up 1 > 27 1 osd.27 up 1 > > > > Showing copes set to 3. > # ceph osd dump | grep " size " > pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 63 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 65 owner 0 pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 2368 pgp_num 2368 last_change 6061 owner 0 > > > > > Crushmap > # begin crush map > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > device 4 osd.4 > device 5 osd.5 > device 6 osd.6 > device 7 osd.7 > device 8 osd.8 > device 9 osd.9 > device 10 osd.10 > device 11 osd.11 > device 12 osd.12 > device 13 osd.13 > device 14 osd.14 > device 15 osd.15 > device 16 osd.16 > device 17 osd.17 > device 18 osd.18 > device 19 osd.19 > device 20 osd.20 > device 21 osd.21 > device 22 osd.22 > device 23 osd.23 > device 24 osd.24 > device 25 osd.25 > device 26 osd.26 > device 27 osd.27 > > # types > type 0 osd > type 1 host > type 2 rack > type 3 row > type 4 room > type 5 datacenter > type 6 root > > # buckets > host blade151 { > id -2 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.0 weight 1.000 > } > host blade152 { > id -4 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.1 weight 1.000 > } > host blade153 { > id -15 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.2 weight 1.000 > } > host blade154 { > id -17 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.3 weight 1.000 > } > host blade155 { > id -18 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.4 weight 1.000 > } > host blade159 { > id -19 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.5 weight 1.000 > } > host blade160 { > id -20 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.6 weight 1.000 > } > host blade161 { > id -21 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.7 weight 1.000 > } > host blade162 { > id -22 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.8 weight 1.000 > } > host blade163 { > id -23 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.9 weight 1.000 > } > datacenter hok { > id -3 # do not change unnecessarily > # weight 10.000 > alg straw > hash 0 # rjenkins1 > item blade151 weight 1.000 > item blade152 weight 1.000 > item blade153 weight 1.000 > item blade154 weight 1.000 > item blade155 weight 1.000 > item blade159 weight 1.000 > item blade160 weight 1.000 > item blade161 weight 1.000 > item blade162 weight 1.000 > item blade163 weight 1.000 > } > host admbc0-01 { > id -5 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.10 weight 1.000 > } > host admbc0-02 { > id -6 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.11 weight 1.000 > } > host admbc0-03 { > id -7 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.12 weight 1.000 > } > host admbc0-04 { > id -8 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.13 weight 1.000 > } > host admbc0-05 { > id -9 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.14 weight 1.000 > } > host admbc0-06 { > id -10 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.15 weight 1.000 > } > host admbc0-09 { > id -11 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.16 weight 1.000 > } > host admbc0-10 { > id -12 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.17 weight 1.000 > } > host admbc0-11 { > id -13 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.18 weight 1.000 > } > host admbc0-12 { > id -14 # do not change unnecessarily > # weight 1.000 > alg straw > hash 0 # rjenkins1 > item osd.19 weight 1.000 > } > datacenter csc { > id -24 # do not change unnecessarily > # weight 10.000 > alg straw > hash 0 # rjenkins1 > item admbc0-01 weight 1.000 > item admbc0-02 weight 1.000 > item admbc0-03 weight 1.000 > item admbc0-04 weight 1.000 > item admbc0-05 weight 1.000 > item admbc0-06 weight 1.000 > item admbc0-09 weight 1.000 > item admbc0-10 weight 1.000 > item admbc0-11 weight 1.000 > item admbc0-12 weight 1.000 > } > host admdisk0 { > id -16 # do not change unnecessarily > # weight 8.000 > alg straw > hash 0 # rjenkins1 > item osd.20 weight 1.000 > item osd.21 weight 1.000 > item osd.22 weight 1.000 > item osd.23 weight 1.000 > item osd.24 weight 1.000 > item osd.25 weight 1.000 > item osd.26 weight 1.000 > item osd.27 weight 1.000 > } > datacenter adm { > id -25 # do not change unnecessarily > # weight 8.000 > alg straw > hash 0 # rjenkins1 > item admdisk0 weight 8.000 > } > root default { > id -1 # do not change unnecessarily > # weight 80.000 > alg straw > hash 0 # rjenkins1 > item hok weight 36.000 > item csc weight 36.000 > item adm weight 8.000 > } > > # rules > rule data { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type datacenter > step emit > } > rule metadata { > ruleset 1 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type datacenter > step emit > } > rule rbd { > ruleset 2 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type datacenter > step emit > } > > # end crush map > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html > N?叉??y??b??千v??)藓{.n?+???z鳐?ay????,j?f"?????ア?⒎?:+v????????赙zZ+????"?!tml= > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html