Re: New cluster in unhealthy state

Dave Durkee <dave@xxxxxxx> · Mon, 22 Jun 2015 17:38:57 +0000

I am seeing the following in the osd log files:

2015-06-22 10:47:53.966056 7f7837cdc700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.2:6802/3018 pipe(0x55ac800 sd=72 :6800 s=0 pgs=0 cs=0 l=0 c=0x4c444c0).accept connect_seq 2 vs existing 1 state standby
2015-06-22 10:47:53.966219 7f7837bdb700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.4:6800/2099 pipe(0x55a8000 sd=74 :6800 s=0 pgs=0 cs=0 l=0 c=0x4c448e0).accept connect_seq 2 vs existing 1 state standby
2015-06-22 10:47:53.966480 7f7837ddd700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.3:6802/4159 pipe(0x55b1000 sd=54 :6800 s=0 pgs=0 cs=0 l=0 c=0x4c44ba0).accept connect_seq 2 vs existing 1 state standby
2015-06-22 11:02:54.066582 7f7837cdc700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.2:6802/3018 pipe(0x55ac800 sd=72 :6800 s=2 pgs=17 cs=3 l=0 c=0x4c44e60).fault with nothing to send, going to standby
2015-06-22 11:02:54.066735 7f7837bdb700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.4:6800/2099 pipe(0x55a8000 sd=74 :6800 s=2 pgs=10 cs=3 l=0 c=0x4c44d00).fault with nothing to send, going to standby
2015-06-22 11:02:54.066783 7f7837ddd700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.3:6802/4159 pipe(0x55b1000 sd=54 :6800 s=2 pgs=11 cs=3 l=0 c=0x4c45280).fault with nothing to send, going to standby
2015-06-22 11:02:55.272711 7f78383e3700  0 -- 10.0.0.2:6800/2787 >> 10.0.0.2:6804/3313 pipe(0x4f82000 sd=47 :57912 s=2 pgs=15 cs=3 l=0 c=0x4c44360).fault with nothing to send, going to standby

Dave Durkee

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Dave Durkee

Sent: Monday, June 22, 2015 10:27 AM

To: Nick Fisk; ceph-users@xxxxxxxxxxxxxx

Subject: Re:  New cluster in unhealthy state

Nick, I removed the failed OSD’s yet I am still in the same state?

ceph> status
    cluster b4419183-5320-4701-aae2-eb61e186b443
     health HEALTH_WARN
            32 pgs degraded
            64 pgs stale
            32 pgs stuck degraded
            246 pgs stuck inactive
            64 pgs stuck stale
            310 pgs stuck unclean
            32 pgs stuck undersized
            32 pgs undersized
            pool rbd pg_num 310 > pgp_num 64
     monmap e1: 1 mons at {mon=172.17.1.16:6789/0}
            election epoch 1, quorum 0 mon
     osdmap e82: 9 osds: 9 up, 9 in
      pgmap v196: 310 pgs, 1 pools, 0 bytes data, 0 objects
            303 MB used, 4189 GB / 4189 GB avail
                 246 creating
                  32 stale+active+undersized+degraded
                  32 stale+active+remapped

ceph> osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY

-1 4.04997 root default                                    

-2 1.34999     host osd1                                   

 2 0.45000         osd.2       up  1.00000          1.00000

 3 0.45000         osd.3       up  1.00000          1.00000

10 0.45000         osd.10      up  1.00000          1.00000

-3 1.34999     host osd2                                   

 4 0.45000         osd.4       up  1.00000          1.00000

 5 0.45000         osd.5       up  1.00000          1.00000

 6 0.45000         osd.6       up  1.00000          1.00000

-4 1.34999     host osd3                                   

 7 0.45000         osd.7       up  1.00000          1.00000

 8 0.45000         osd.8       up  1.00000          1.00000

 9 0.45000         osd.9       up  1.00000          1.00000

ceph> osd pool set rbd pgp_num 310
Error: 16 EBUSY
Status:
currently creating pgs, wait
ceph>

Dave Durkee

From: Nick Fisk [mailto:nick@xxxxxxxxxx]

Sent: Saturday, June 20, 2015 9:17 AM

To: Dave Durkee; ceph-users@xxxxxxxxxxxxxx

Subject: RE: New cluster in unhealthy state

Hi Dave,

It can’t increase the pgp’s because the pg’s are still being created. I can see you currently have 2 OSD’s down, not 100% certain this is the cause, but you might to try and get them back online
 or remove them if they no longer exist.

Nick

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Dave Durkee

Sent: 19 June 2015 23:39

To: Nick Fisk; ceph-users@xxxxxxxxxxxxxx

Subject: Re:  New cluster in unhealthy state

ceph> osd pool set rbd pgp_num 310
Error: 16 EBUSY
Status:
currently creating pgs, wait

What does the above mean?

Dave Durkee

From: Nick Fisk [mailto:nick@xxxxxxxxxx]

Sent: Friday, June 19, 2015 4:02 PM

To: Dave Durkee; ceph-users@xxxxxxxxxxxxxx

Subject: RE: New cluster in unhealthy state

Try
ceph osd pool set rbd pgp_num 310

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx]
On Behalf Of Dave Durkee

Sent: 19 June 2015 22:31

To: ceph-users@xxxxxxxxxxxxxx

Subject:  New cluster in unhealthy state

I just built a small lab cluster.  1 mon node, 3 osd nodes with 3 ceph disks and 1 os/journal disk, an admin vm and 3 client vm’s.

I followed the preflight and install instructions and when I finished adding the osd’s I ran a ceph status and got the following:

ceph> status
    cluster b4419183-5320-4701-aae2-eb61e186b443
     health HEALTH_WARN
            32 pgs degraded
            64 pgs stale
            32 pgs stuck degraded
            246 pgs stuck inactive
            64 pgs stuck stale
            310 pgs stuck unclean
            32 pgs stuck undersized
            32 pgs undersized
            pool rbd pg_num 310 > pgp_num 64
     monmap e1: 1 mons at {mon=172.17.1.16:6789/0}
            election epoch 2, quorum 0 mon
     osdmap e49: 11 osds: 9 up, 9 in
      pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects
            298 MB used, 4189 GB / 4189 GB avail
                 246 creating
                  32 stale+active+undersized+degraded
                  32 stale+active+remapped

ceph> health
HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs stuck inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck undersized; 32 pgs undersized; pool rbd pg_num 310 > pgp_num 64

ceph> quorum_status
{"election_epoch":2,"quorum":[0],"quorum_names":["mon"],"quorum_leader_name":"mon","monmap":{"epoch":1,"fsid":"b4419183-5320-4701-aae2-eb61e186b443","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"mon","addr":"172.17.1.16:6789\/0"}]}}

ceph> mon_status
{"name":"mon","rank":0,"state":"leader","election_epoch":2,"quorum":[0],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":1,"fsid":"b4419183-5320-4701-aae2-eb61e186b443","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"mon","addr":"172.17.1.16:6789\/0"}]}}

ceph> osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY 

-1 4.94997 root default                                     

-2 2.24998     host osd1                                    

 0 0.45000         osd.0     down        0          1.00000 

 1 0.45000         osd.1     down        0          1.00000 

 2 0.45000         osd.2       up  1.00000          1.00000 

 3 0.45000         osd.3       up  1.00000          1.00000 

10 0.45000         osd.10      up  1.00000          1.00000 

-3 1.34999     host osd2                                    

 4 0.45000         osd.4       up  1.00000          1.00000 

 5 0.45000         osd.5       up  1.00000          1.00000 

 6 0.45000         osd.6       up  1.00000          1.00000 

-4 1.34999     host osd3                                    

 7 0.45000         osd.7       up  1.00000          1.00000 

 8 0.45000         osd.8       up  1.00000          1.00000 

 9 0.45000         osd.9       up  1.00000          1.00000

Admin-node:
[root@admin test-cluster]# cat ceph.conf
[global]
auth_service_required = cephx
filestore_xattr_use_omap = true
auth_client_required = cephx
auth_cluster_required = cephx
mon_host = 172.17.1.16
mon_initial_members = mon
fsid = b4419183-5320-4701-aae2-eb61e186b443
osd pool default size = 2
public network = 172.17.1.0/24
cluster network = 10.0.0.0/24

How do I diagnose and solve the cluster health issue?  Do you need any additional information to help with the diag process?

Thanks!!

Dave

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com