Re: Behaviour of Ceph while OSDs are down

Christian Eichelmann <christian.eichelmann@xxxxxxxx> · Wed, 21 Jan 2015 11:20:17 +0100

Hi Samuel, Hi Gregory,

we are using Giant (0.87).

Sure, I was checking on this PGs. The strange thing was, that they 
reported a bad state ("state": "inactive"), but looking at the recovery 
state, everything seems to be fine. That would point to the mentioned 
bug. Do you have a link to this bug, so I can have a look at it to 
confirm that we are having the same issues?

Here is a pg_query (slightly older and with only 3x replication, so 
don't be confused):
http://pastebin.com/fyC8Qepv

Regards,
Christian

On 01/20/2015 10:57 PM, Samuel Just wrote:
Version?
-Sam

On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann
<christian.eichelmann@xxxxxxxx> wrote:
Hi all,

I want to understand what Ceph does if several OSDs are down. First of our,
some words to our Setup:

We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers
are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server.
We have a replication factor of 4 and a crush rule applied that says "step
chooseleaf firstn 0 type rack". So, in my oppinion, every rack should hold a
copy of all the data in our ceph cluster. Is that more or less correct?

So, our cluster is in state health OK and I am rebooting one of our OSD
servers. That means 60 of 720 OSDs are going down. Since this hardware takes
quite some time to boot up, we are using "mon osd down out subtree limit =
host" to avoid rebalancing when a whole server goes down. Ceph show this
output of "ceph -s" while the OSDs are down:

      health HEALTH_WARN 22227 pgs degraded; 1 pgs peering; 22227 pgs stuck
degraded; 1 pgs stuck inactive; 22228 pgs stuck unclean; 22227 pgs stuck und
ersized; 22227 pgs undersized; recovery 623/7420 objects degraded (8.396%);
60/720 in osds are down
      monmap e5: 5 mons at
{mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1
0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4
mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
      osdmap e60390: 720 osds: 660 up, 720 in
       pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
             3948 GB used, 1304 TB / 1308 TB avail
             623/7420 objects degraded (8.396%)
                45356 active+clean
                    1 peering
                22227 active+undersized+degraded

The pgs that are degraded and undersized are not a problem, since this
behaviour is expected. I am worried about the peering pg (it stays in this
state until all osds are up again) since this would cause I/O to hang if I
am not mistaken.

After the host is back up and all OSDs are up and running again, I see this:

      health HEALTH_WARN 2 pgs stuck unclean
      monmap e5: 5 mons at
{mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0},
election epoch 228, quorum 0,1,2,3,4
mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
      osdmap e60461: 720 osds: 720 up, 720 in
       pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
             3972 GB used, 1304 TB / 1308 TB avail
                    2 inactive
                67582 active+clean

Without any interaction, it will stay in this state. I guess these two
inactive pgs will also cause I/O to hang? Some more information:

ceph health detail
HEALTH_WARN 2 pgs stuck unclean
pg 9.f765 is stuck unclean for 858.298811, current state inactive, last
acting [91,362,484,553]
pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last
acting [91,233,485,524]

I was trying to give osd.91 a kick with "ceph osd down 91"

After the osd is back in the cluster:
health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck
unclean

So even worse. I decided to take the osd out. The cluster goes back to
HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing,
ending with the cluster in an OK state again.

That actually happens everytime when there are some OSDs going down. I don't
understand why the cluster is not able to get back to a healthy state
without admin interaction. In a setup with several hundred OSDs it is normal
business that some of the go down from time to time. Are there any ideas why
this is happening? Right now, we do not have many data in our cluster, so I
can do some tests. Any suggestions would be appreciated.
Have you done any digging into the state of the PGs reported as
peering or inactive or whatever when this pops up? Running pg_query,
looking at their calculated and acting sets, etc.

I suspect it's more likely you're exposing a reporting bug with stale
data, rather than actually stuck PGs, but it would take more
information to check that out.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com