Re: Behaviour of Ceph while OSDs are down

Samuel Just <sam.just@xxxxxxxxxxx> · Tue, 20 Jan 2015 13:57:58 -0800



Version?
-Sam

On Tue, Jan 20, 2015 at 9:45 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Tue, Jan 20, 2015 at 2:40 AM, Christian Eichelmann
> <christian.eichelmann@xxxxxxxx> wrote:
>> Hi all,
>>
>> I want to understand what Ceph does if several OSDs are down. First of our,
>> some words to our Setup:
>>
>> We have 5 Monitors and 12 OSD Server, each has 60x2TB Disks. These Servers
>> are spread across 4 racks in our datacenter. Every rack holds 3 OSD Server.
>> We have a replication factor of 4 and a crush rule applied that says "step
>> chooseleaf firstn 0 type rack". So, in my oppinion, every rack should hold a
>> copy of all the data in our ceph cluster. Is that more or less correct?
>>
>> So, our cluster is in state health OK and I am rebooting one of our OSD
>> servers. That means 60 of 720 OSDs are going down. Since this hardware takes
>> quite some time to boot up, we are using "mon osd down out subtree limit =
>> host" to avoid rebalancing when a whole server goes down. Ceph show this
>> output of "ceph -s" while the OSDs are down:
>>
>>      health HEALTH_WARN 22227 pgs degraded; 1 pgs peering; 22227 pgs stuck
>> degraded; 1 pgs stuck inactive; 22228 pgs stuck unclean; 22227 pgs stuck und
>> ersized; 22227 pgs undersized; recovery 623/7420 objects degraded (8.396%);
>> 60/720 in osds are down
>>      monmap e5: 5 mons at
>> {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=1
>> 0.76.28.9:6789/0}, election epoch 228, quorum 0,1,2,3,4
>> mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
>>      osdmap e60390: 720 osds: 660 up, 720 in
>>       pgmap v15427437: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
>>             3948 GB used, 1304 TB / 1308 TB avail
>>             623/7420 objects degraded (8.396%)
>>                45356 active+clean
>>                    1 peering
>>                22227 active+undersized+degraded
>>
>> The pgs that are degraded and undersized are not a problem, since this
>> behaviour is expected. I am worried about the peering pg (it stays in this
>> state until all osds are up again) since this would cause I/O to hang if I
>> am not mistaken.
>>
>> After the host is back up and all OSDs are up and running again, I see this:
>>
>>      health HEALTH_WARN 2 pgs stuck unclean
>>      monmap e5: 5 mons at
>> {mon-bs01=10.76.28.160:6789/0,mon-bs02=10.76.28.161:6789/0,mon-bs03=10.76.28.162:6789/0,mon-bs04=10.76.28.8:6789/0,mon-bs05=10.76.28.9:6789/0},
>> election epoch 228, quorum 0,1,2,3,4
>> mon-bs04,mon-bs05,mon-bs01,mon-bs02,mon-bs03
>>      osdmap e60461: 720 osds: 720 up, 720 in
>>       pgmap v15427555: 67584 pgs, 2 pools, 7253 MB data, 1855 objects
>>             3972 GB used, 1304 TB / 1308 TB avail
>>                    2 inactive
>>                67582 active+clean
>>
>> Without any interaction, it will stay in this state. I guess these two
>> inactive pgs will also cause I/O to hang? Some more information:
>>
>> ceph health detail
>> HEALTH_WARN 2 pgs stuck unclean
>> pg 9.f765 is stuck unclean for 858.298811, current state inactive, last
>> acting [91,362,484,553]
>> pg 9.ea0f is stuck unclean for 963.441117, current state inactive, last
>> acting [91,233,485,524]
>>
>> I was trying to give osd.91 a kick with "ceph osd down 91"
>>
>> After the osd is back in the cluster:
>> health HEALTH_WARN 3 pgs peering; 54 pgs stuck inactive; 57 pgs stuck
>> unclean
>>
>> So even worse. I decided to take the osd out. The cluster goes back to
>> HEALTH_OK. Bringing the OSD back in, the cluster does some rebalancing,
>> ending with the cluster in an OK state again.
>>
>> That actually happens everytime when there are some OSDs going down. I don't
>> understand why the cluster is not able to get back to a healthy state
>> without admin interaction. In a setup with several hundred OSDs it is normal
>> business that some of the go down from time to time. Are there any ideas why
>> this is happening? Right now, we do not have many data in our cluster, so I
>> can do some tests. Any suggestions would be appreciated.
>
> Have you done any digging into the state of the PGs reported as
> peering or inactive or whatever when this pops up? Running pg_query,
> looking at their calculated and acting sets, etc.
>
> I suspect it's more likely you're exposing a reporting bug with stale
> data, rather than actually stuck PGs, but it would take more
> information to check that out.
> -Greg
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com