Re: Placement Groups fail on fresh Ceph cluster installation with all OSDs up and in

B L <super.iterator@xxxxxxxxx> · Tue, 10 Feb 2015 13:45:40 +0200

Hello Vickie,

After changing the size and min_size on all the existing pools, the cluster seems to be working, and I can store objects to the cluster .. but the cluster still shows non healthy: 

cluster 17bea68b-1634-4cd1-8b2a-00a60ef4761d
     health HEALTH_WARN 256 pgs degraded; 256 pgs stuck unclean; recovery 1/2 objects degraded (50.000%); pool data pg_num 128 > pgp_num 64
     monmap e1: 1 mons at {ceph-node1=172.31.0.84:6789/0}, election epoch 2, quorum 0 ceph-node1
     osdmap e31: 6 osds: 6 up, 6 in
      pgmap v99: 256 pgs, 3 pools, 10240 kB data, 1 objects
            210 MB used, 18155 MB / 18365 MB avail
            1/2 objects degraded (50.000%)
                 256 active+degraded

I can see some changes like:
	1- recovery 1/2 objects degraded (50.000%)
	2- 1/2 objects degraded (50.000%)
	3- 256 active+degraded

My question is:
 1- What do those changes mean
 2- How changing replication size can cause the cluster to be un healthy 

Thanks Vickie!
Beanos

On Feb 10, 2015, at 1:28 PM, B L <super.iterator@xxxxxxxxx> wrote:

I changed the size and min_size as you suggested while opening the ceph -w on a different window, and I got this:

ceph@ceph-node1:~$ ceph -w
    cluster 17bea68b-1634-4cd1-8b2a-00a60ef4761d
     health HEALTH_WARN 256 pgs incomplete; 256 pgs stuck inactive; 256 pgs stuck unclean; pool data pg_num 128 > pgp_num 64
     monmap e1: 1 mons at {ceph-node1=172.31.0.84:6789/0}, election epoch 2, quorum 0 ceph-node1
     osdmap e25: 6 osds: 6 up, 6 in
      pgmap v82: 256 pgs, 3 pools, 0 bytes data, 0 objects
            198 MB used, 18167 MB / 18365 MB avail
                 192 incomplete
                  64 creating+incomplete

2015-02-10 11:22:24.421000 mon.0 [INF] osdmap e26: 6 osds: 6 up, 6 in
2015-02-10 11:22:24.425906 mon.0 [INF] pgmap v83: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:25.432950 mon.0 [INF] osdmap e27: 6 osds: 6 up, 6 in
2015-02-10 11:22:25.437626 mon.0 [INF] pgmap v84: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:26.449640 mon.0 [INF] osdmap e28: 6 osds: 6 up, 6 in
2015-02-10 11:22:26.454749 mon.0 [INF] pgmap v85: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:27.474113 mon.0 [INF] pgmap v86: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:31.770385 mon.0 [INF] pgmap v87: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:41.695656 mon.0 [INF] osdmap e29: 6 osds: 6 up, 6 in
2015-02-10 11:22:41.700296 mon.0 [INF] pgmap v88: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:42.712288 mon.0 [INF] osdmap e30: 6 osds: 6 up, 6 in
2015-02-10 11:22:42.716877 mon.0 [INF] pgmap v89: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:43.723701 mon.0 [INF] osdmap e31: 6 osds: 6 up, 6 in
2015-02-10 11:22:43.732035 mon.0 [INF] pgmap v90: 256 pgs: 192 incomplete, 64 creating+incomplete; 0 bytes data, 198 MB used, 18167 MB / 18365 MB avail
2015-02-10 11:22:46.774217 mon.0 [INF] pgmap v91: 256 pgs: 256 active+degraded; 0 bytes data, 199 MB used, 18165 MB / 18365 MB avail
2015-02-10 11:23:08.232686 mon.0 [INF] pgmap v92: 256 pgs: 256 active+degraded; 0 bytes data, 200 MB used, 18165 MB / 18365 MB avail
2015-02-10 11:23:27.767358 mon.0 [INF] pgmap v93: 256 pgs: 256 active+degraded; 0 bytes data, 200 MB used, 18165 MB / 18365 MB avail
2015-02-10 11:23:40.769794 mon.0 [INF] pgmap v94: 256 pgs: 256 active+degraded; 0 bytes data, 200 MB used, 18165 MB / 18365 MB avail
2015-02-10 11:23:45.530713 mon.0 [INF] pgmap v95: 256 pgs: 256 active+degraded; 0 bytes data, 200 MB used, 18165 MB / 18365 MB avail

On Feb 10, 2015, at 1:24 PM, B L <super.iterator@xxxxxxxxx> wrote:

I will try to change the replication size now as you suggested .. but how is that related to the non-healthy cluster?

On Feb 10, 2015, at 1:22 PM, B L <super.iterator@xxxxxxxxx> wrote:

Hi Vickie,

My OSD tree looks like this:

ceph@ceph-node3:/home/ubuntu$ ceph osd tree
# id	weight	type name	up/down	reweight
-1	0	root default
-2	0		host ceph-node1
0	0			osd.0	up	1
1	0			osd.1	up	1
-3	0		host ceph-node3
2	0			osd.2	up	1
3	0			osd.3	up	1
-4	0		host ceph-node2
4	0			osd.4	up	1
5	0			osd.5	up	1

On Feb 10, 2015, at 1:18 PM, Vickie ch <mika.leaf666@xxxxxxxxx> wrote:

Hi Beanos：
BTW, if your cluster just for test. You may try to reduce replica size and min_size. 
"ceph osd pool set rbd size 2;ceph osd pool set data size 2;ceph osd pool set metadata size 2 "
"ceph osd pool set rbd min_size 1;ceph osd pool set data min_size 1;ceph osd pool set metadata min_size 1"
Open another terminal and use command "ceph -w" watch pg and pgs status .

Best wishes,
Vickie

2015-02-10 19:16 GMT+08:00 Vickie ch <mika.leaf666@xxxxxxxxx>:
Hi Beanos：
So you have 3 OSD servers and each of them have 2 disks. 
I have a question. What result of "ceph osd tree". Look like the osd status is "down".

Best wishes,
Vickie

2015-02-10 19:00 GMT+08:00 B L <super.iterator@xxxxxxxxx>:
Here is the updated direct copy/paste dump

eph@ceph-node1:~$ ceph osd dump
epoch 25
fsid 17bea68b-1634-4cd1-8b2a-00a60ef4761d
created 2015-02-08 16:59:07.050875
modified 2015-02-09 22:35:33.191218
flags
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 64 last_change 24 flags hashpspool crash_replay_interval 45 stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
max_osd 6
osd.0 up   in  weight 1 up_from 4 up_thru 17 down_at 0 last_clean_interval [0,0) 172.31.0.84:6800/11739 172.31.0.84:6801/11739 172.31.0.84:6802/11739 172.31.0.84:6803/11739 exists,up 765f5066-d13e-4a9e-a446-8630ee06e596
osd.1 up   in  weight 1 up_from 7 up_thru 0 down_at 0 last_clean_interval [0,0) 172.31.0.84:6805/12279 172.31.0.84:6806/12279 172.31.0.84:6807/12279 172.31.0.84:6808/12279 exists,up e1d073e5-9397-4b63-8b7c-a4064e430f7a
osd.2 up   in  weight 1 up_from 10 up_thru 0 down_at 0 last_clean_interval [0,0) 172.31.3.57:6800/5517 172.31.3.57:6801/5517 172.31.3.57:6802/5517 172.31.3.57:6803/5517 exists,up 5af5deed-7a6d-4251-aa3c-819393901d1f
osd.3 up   in  weight 1 up_from 13 up_thru 0 down_at 0 last_clean_interval [0,0) 172.31.3.57:6805/6043 172.31.3.57:6806/6043 172.31.3.57:6807/6043 172.31.3.57:6808/6043 exists,up 958f37ab-b434-40bd-87ab-3acbd3118f92
osd.4 up   in  weight 1 up_from 16 up_thru 0 down_at 0 last_clean_interval [0,0) 172.31.3.56:6800/5106 172.31.3.56:6801/5106 172.31.3.56:6802/5106 172.31.3.56:6803/5106 exists,up ce5c0b86-96be-408a-8022-6397c78032be
osd.5 up   in  weight 1 up_from 22 up_thru 0 down_at 0 last_clean_interval [0,0) 172.31.3.56:6805/7019 172.31.3.56:6806/7019 172.31.3.56:6807/7019 172.31.3.56:6808/7019 exists,up da67b604-b32a-44a0-9920-df0774ad2ef3

On Feb 10, 2015, at 12:55 PM, B L <super.iterator@xxxxxxxxx> wrote:

On Feb 10, 2015, at 12:37 PM, B L <super.iterator@xxxxxxxxx> wrote:

Hi Vickie,

Thanks for your reply!

You can find the dump in this link:

https://gist.github.com/anonymous/706d4a1ec81c93fd1eca

Thanks!
B.

On Feb 10, 2015, at 12:23 PM, Vickie ch <mika.leaf666@xxxxxxxxx> wrote:

Hi Beanos：
   Would you post the reult of "$ceph osd dump"？

Best wishes,
Vickie

2015-02-10 16:36 GMT+08:00 B L <super.iterator@xxxxxxxxx>:
Having problem with my fresh non-healthy cluster, my cluster status summary shows this:
ceph@ceph-node1:~$ ceph -s

    cluster 17bea68b-1634-4cd1-8b2a-00a60ef4761d
     health HEALTH_WARN 256 pgs incomplete; 256 pgs stuck inactive; 256 pgs stuck unclean; pool data pg_num 128 > pgp_num 64
     monmap e1: 1 mons at {ceph-node1=172.31.0.84:6789/0}, election epoch 2, quorum 0 ceph-node1
     osdmap e25: 6 osds: 6 up, 6 in
      pgmap v82: 256 pgs, 3 pools, 0 bytes data, 0 objects
            198 MB used, 18167 MB / 18365 MB avail
                 192 incomplete
                  64 creating+incomplete

Where shall I start troubleshooting this?

P.S. I’m new to CEPH.

Thanks!
Beanos

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com