Might be good if you can attach the full decompiled crushmap so we can see exactly how things are listed/setup. -----Original Message----- From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Karol Babioch Sent: 19 March 2017 20:42 To: ceph-users@xxxxxxxxxxxxxx Subject: Understanding Ceph in case of a failure Hi, I have a few questions regarding Ceph in case of a failure. My setup consists of three monitors and two hosts, each of which hosts a couple of OSDs. Basically it looks like this: > root@max:~# ceph osd tree > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY > -1 19.87860 root default > -2 9.94470 host max > 0 0.90399 osd.0 up 1.00000 1.00000 > 2 0.90399 osd.2 up 1.00000 1.00000 > 3 0.90399 osd.3 up 1.00000 1.00000 > 5 0.90399 osd.5 up 1.00000 1.00000 > 6 0.90399 osd.6 up 1.00000 1.00000 > 7 0.90399 osd.7 up 1.00000 1.00000 > 8 0.90399 osd.8 up 1.00000 1.00000 > 9 0.90399 osd.9 up 1.00000 1.00000 > 10 0.90399 osd.10 up 1.00000 1.00000 > 4 0.90439 osd.4 up 1.00000 1.00000 > 1 0.90439 osd.1 up 1.00000 1.00000 > -3 9.93390 host moritz > 12 0.90399 osd.12 up 1.00000 1.00000 > 13 0.90399 osd.13 up 1.00000 1.00000 > 14 0.90399 osd.14 up 1.00000 1.00000 > 15 0.90399 osd.15 up 1.00000 1.00000 > 16 0.90399 osd.16 up 1.00000 1.00000 > 17 0.90399 osd.17 up 1.00000 1.00000 > 18 0.90399 osd.18 up 1.00000 1.00000 > 20 0.90399 osd.20 up 1.00000 1.00000 > 21 0.90399 osd.21 up 1.00000 1.00000 > 11 0.90439 osd.11 up 1.00000 1.00000 > 22 0.89359 osd.22 up 1.00000 1.00000 Then there are a bunch of pools, all of which have size = 2 and min_size = 1 set, along with the default ruleset, so there should be one copy of each object per host. > root@max:~# ceph osd lspools > 0 rbd,1 virtstorage,3 virtlock,4 virtmetalock,11 virtlog, All of this works great and I have a reasonable understanding of how everything is working together as long as the status is HEALTH_OK. For instance, using "ceph pg dump" I can get the current PG <-> OSD mapping. In case of HEALTH_OK, this looks something like this: > 4.308 0 0 0 0 0 0 0 0 active+clean 2017-03-17 16:39:25.897857 0'0 7087:374 [20,7] 20 [20,7] 20 > 0'0 2017-03-17 16:39:25.897737 0'0 2017-03-17 16:39:25.897737 To my understanding this means that the PG 308 in pool 4 (virtmetalock) is mapped to OSD 20 and 7 (20 being the primary), both of which are up and running. Now, I don't quite understand how Ceph behaves when one of the two storage hosts dies. E.g. if I shut down moritz, the status is this: > cluster ac1872be-6bd5-4ab2-8ca3-a34faf6dd422 > health HEALTH_WARN > 2788 pgs degraded > 2776 pgs stuck unclean > 2788 pgs undersized > recovery 90278/180556 objects degraded (50.000%) > 11/22 in osds are down > 1 mons down, quorum 0,2 max,thales > monmap e3: 3 mons at {max=1.2.3.4:6789/0,moritz=2.3.4.5:6789/0,thales=3.4.5.6:6789/0} > election epoch 76, quorum 0,2 max,thales > osdmap e7089: 22 osds: 11 up, 22 in; 2788 remapped pgs > flags sortbitwise > pgmap v4991853: 2788 pgs, 5 pools, 286 GB data, 90278 objects > 1007 GB used, 19355 GB / 20362 GB avail > 90278/180556 objects degraded (50.000%) > 2788 active+undersized+degraded Ceph correctly determines that half of the OSDs are offline and hence half of the objects are degraded. The dump for the PG mentioned above now looks like this: > 4.308 0 0 0 0 0 0 0 0 active+undersized+degraded 2017-03-17 22:39:22.146019 0'0 7089:374 [7] 7 > [7] 7 0'0 2017-03-17 16:39:25.897737 0'0 2017-03-17 16:39:25.897737 Since OSD 20 is down, only OSD 7 remains in the up and acting set. All of this is expected. But now the weird part begins. After about five minutes or so, the cluster starts massive recovery I/O: > cluster ac1872be-6bd5-4ab2-8ca3-a34faf6dd422 > health HEALTH_WARN > 289 pgs backfill_wait > 8 pgs backfilling > 1829 pgs degraded > 2788 pgs stuck unclean > 1829 pgs undersized > recovery 83556/180556 objects degraded (46.277%) > recovery 83435/180556 objects misplaced (46.210%) > 1 mons down, quorum 0,2 max,thales > monmap e3: 3 mons at {max=1.2.3.4:6789/0,moritz=2.3.4.5:6789/0,thales=3.4.5.6:6789/0} > election epoch 76, quorum 0,2 max,thales > osdmap e7163: 22 osds: 11 up, 11 in; 2788 remapped pgs > flags sortbitwise > pgmap v4992040: 2788 pgs, 5 pools, 286 GB data, 90278 objects > 523 GB used, 9663 GB / 10186 GB avail > 83556/180556 objects degraded (46.277%) > 83435/180556 objects misplaced (46.210%) > 1532 active+undersized+degraded > 911 active > 289 active+undersized+degraded+remapped+wait_backfill > 48 active+remapped > 8 active+undersized+degraded+remapped+backfilling > recovery io 407 MB/s, 101 objects/s I don't quite understand, why it starts to recover at this point and what it is trying to achieve. Probably its aiming towards two copies per object on the remaining host. The PG dump now shows that OSD 10 and 7 are responsible for 4.308: > 4.308 0 0 0 0 0 0 0 0 active 2017-03-17 22:44:27.817912 0'0 7093:5 [10] 10 [10,7] 10 0'0 2017-03-17 16:39:25.897737 0'0 2017-03-17 16:39:25.897737 This seems odd to me, since the ruleset states two objects per distinct host. But what totally confuses me is that the whole recovery process gets stuck after a while: > cluster ac1872be-6bd5-4ab2-8ca3-a34faf6dd422 > health HEALTH_WARN > 1532 pgs degraded > 2788 pgs stuck unclean > 1532 pgs undersized > recovery 44972/180556 objects degraded (24.908%) > recovery 45306/180556 objects misplaced (25.092%) > 1 mons down, quorum 0,2 max,thales > monmap e3: 3 mons at {max=1.2.3.4:6789/0,moritz=2.3.4.5:6789/0,thales=3.4.5.6:6789/0} > election epoch 76, quorum 0,2 max,thales > osdmap e7599: 22 osds: 11 up, 11 in; 2788 remapped pgs > flags sortbitwise > pgmap v4993261: 2788 pgs, 5 pools, 286 GB data, 90278 objects > 671 GB used, 9515 GB / 10186 GB avail > 44972/180556 objects degraded (24.908%) > 45306/180556 objects misplaced (25.092%) > 1532 active+undersized+degraded > 911 active > 345 active+remapped >From this point on no recovery is going on anymore. I've waited a couple of hours, but to no availability. I don't know what the expected state should be, but this seems wrong to me, since it is neither recovered, nor staying degraded. When re-enabling the second storage host, all OSDs come up again, the cluster synchronizes itself, and everything is fine and dandy. However, these questions remain unanswered: - With the default ruleset (and the CRUSH map from above) with only two storage hosts (with many OSDs per host), what is the expected behaviour in case one host dies? - Why is the recovery getting stuck in my case at about 25 percent? To me it looks like there is more than enough capacity, so I don't get it. Hopefully the output doesn't scare you away from answering, but I thought it is better to give you all of the details, rather than trying to describe it with my own words :-). Thank you! Best regards, Karol Babioch _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com