Re: Understanding Ceph in case of a failure

Christian Balzer <chibi@xxxxxxx> · Mon, 20 Mar 2017 13:34:30 +0900

Hello,

you do realize that you very much have a corner case setup there, right?

Ceph works best and as expected when you have a replication of 3 and as
least 3 OSD servers, them having enough capacity (space) to handle the loss
of one node.

That being said, if you'd search the archives, a similar question was
raised by me a long time ago.

more below.

On Sun, 19 Mar 2017 13:41:38 +0100 Karol Babioch wrote:

[snip]
> 
> Since OSD 20 is down, only OSD 7 remains in the up and acting set. All
> of this is expected. But now the weird part begins. After about five
> minutes or so, the cluster starts massive recovery I/O:
> 
> >     cluster ac1872be-6bd5-4ab2-8ca3-a34faf6dd422
> >      health HEALTH_WARN
> >             289 pgs backfill_wait
> >             8 pgs backfilling
> >             1829 pgs degraded
> >             2788 pgs stuck unclean
> >             1829 pgs undersized
> >             recovery 83556/180556 objects degraded (46.277%)
> >             recovery 83435/180556 objects misplaced (46.210%)
> >             1 mons down, quorum 0,2 max,thales
> >      monmap e3: 3 mons at {max=1.2.3.4:6789/0,moritz=2.3.4.5:6789/0,thales=3.4.5.6:6789/0}
> >             election epoch 76, quorum 0,2 max,thales
> >      osdmap e7163: 22 osds: 11 up, 11 in; 2788 remapped pgs
> >             flags sortbitwise
> >       pgmap v4992040: 2788 pgs, 5 pools, 286 GB data, 90278 objects
> >             523 GB used, 9663 GB / 10186 GB avail
> >             83556/180556 objects degraded (46.277%)
> >             83435/180556 objects misplaced (46.210%)
> >                 1532 active+undersized+degraded
> >                  911 active
> >                  289 active+undersized+degraded+remapped+wait_backfill
> >                   48 active+remapped
> >                    8 active+undersized+degraded+remapped+backfilling
> > recovery io 407 MB/s, 101 objects/s  
> 
> I don't quite understand, why it starts to recover at this point and
> what it is trying to achieve. Probably its aiming towards two copies per
> object on the remaining host. The PG dump now shows that OSD 10 and 7
> are responsible for 4.308:
> 
> > 4.308   0       0       0       0       0       0       0       0       active  2017-03-17 22:44:27.817912      0'0     7093:5  [10]    10      [10,7]  10      0'0     2017-03-17 16:39:25.897737      0'0     2017-03-17 16:39:25.897737  
> 
> This seems odd to me, since the ruleset states two objects per distinct
> host. But what totally confuses me is that the whole recovery process
> gets stuck after a while:
> 
> >     cluster ac1872be-6bd5-4ab2-8ca3-a34faf6dd422
> >      health HEALTH_WARN
> >             1532 pgs degraded
> >             2788 pgs stuck unclean
> >             1532 pgs undersized
> >             recovery 44972/180556 objects degraded (24.908%)
> >             recovery 45306/180556 objects misplaced (25.092%)
> >             1 mons down, quorum 0,2 max,thales
> >      monmap e3: 3 mons at {max=1.2.3.4:6789/0,moritz=2.3.4.5:6789/0,thales=3.4.5.6:6789/0}
> >             election epoch 76, quorum 0,2 max,thales
> >      osdmap e7599: 22 osds: 11 up, 11 in; 2788 remapped pgs
> >             flags sortbitwise
> >       pgmap v4993261: 2788 pgs, 5 pools, 286 GB data, 90278 objects
> >             671 GB used, 9515 GB / 10186 GB avail
> >             44972/180556 objects degraded (24.908%)
> >             45306/180556 objects misplaced (25.092%)
> >                 1532 active+undersized+degraded
> >                  911 active
> >                  345 active+remapped  
> 
> From this point on no recovery is going on anymore. I've waited a couple
> of hours, but to no availability. I don't know what the expected state
> should be, but this seems wrong to me, since it is neither recovered,
> nor staying degraded.
>
> 
What you are seeing is not recovery per se (as in Ceph trying to put 2
replicas on the same node), but the result of the one host and its OSDs
being removed from the CRUSH map (marked down and out).

The new CRUSH map of course results in different computations of where PGs
should live, so they get copied to their new primary OSDs.
This is the I/O you're seeing and that's why it stops eventually.

Due to the state of your cluster and the requested size of 2 the old
primary PG entries do not get removed (in a scenario with 3 nodes they
would).

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com