Ok, some things to check/confirm - Make sure all your networking is ok, we have seen lots of problems related to jumbo frames not being correctly configured across nodes/switches. Test with pinging large packets between hosts. This includes separate public/cluster networks. - Run ceph health detail – Does it show anything interesting? - Your pool is definitely a 2 way replication pool? - Run a ceph pg dump, can you see a pattern amongst the pgs that have problems? From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dave Durkee Nick, I removed the failed OSD’s yet I am still in the same state? ceph> status cluster b4419183-5320-4701-aae2-eb61e186b443 health HEALTH_WARN 32 pgs degraded 64 pgs stale 32 pgs stuck degraded 246 pgs stuck inactive 64 pgs stuck stale 310 pgs stuck unclean 32 pgs stuck undersized 32 pgs undersized pool rbd pg_num 310 > pgp_num 64 monmap e1: 1 mons at {mon=172.17.1.16:6789/0} election epoch 1, quorum 0 mon osdmap e82: 9 osds: 9 up, 9 in pgmap v196: 310 pgs, 1 pools, 0 bytes data, 0 objects 303 MB used, 4189 GB / 4189 GB avail 246 creating 32 stale+active+undersized+degraded 32 stale+active+remapped ceph> osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 4.04997 root default -2 1.34999 host osd1 2 0.45000 osd.2 up 1.00000 1.00000 3 0.45000 osd.3 up 1.00000 1.00000 10 0.45000 osd.10 up 1.00000 1.00000 -3 1.34999 host osd2 4 0.45000 osd.4 up 1.00000 1.00000 5 0.45000 osd.5 up 1.00000 1.00000 6 0.45000 osd.6 up 1.00000 1.00000 -4 1.34999 host osd3 7 0.45000 osd.7 up 1.00000 1.00000 8 0.45000 osd.8 up 1.00000 1.00000 9 0.45000 osd.9 up 1.00000 1.00000 ceph> osd pool set rbd pgp_num 310 Error: 16 EBUSY Status: currently creating pgs, wait ceph> Dave Durkee From: Nick Fisk [mailto:nick@xxxxxxxxxx] Hi Dave, It can’t increase the pgp’s because the pg’s are still being created. I can see you currently have 2 OSD’s down, not 100% certain this is the cause, but you might to try and get them back online or remove them if they no longer exist. Nick From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dave Durkee ceph> osd pool set rbd pgp_num 310 Error: 16 EBUSY Status: currently creating pgs, wait What does the above mean? Dave Durkee From: Nick Fisk [mailto:nick@xxxxxxxxxx] Try ceph osd pool set rbd pgp_num 310 From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Dave Durkee I just built a small lab cluster. 1 mon node, 3 osd nodes with 3 ceph disks and 1 os/journal disk, an admin vm and 3 client vm’s. I followed the preflight and install instructions and when I finished adding the osd’s I ran a ceph status and got the following: ceph> status cluster b4419183-5320-4701-aae2-eb61e186b443 health HEALTH_WARN 32 pgs degraded 64 pgs stale 32 pgs stuck degraded 246 pgs stuck inactive 64 pgs stuck stale 310 pgs stuck unclean 32 pgs stuck undersized 32 pgs undersized pool rbd pg_num 310 > pgp_num 64 monmap e1: 1 mons at {mon=172.17.1.16:6789/0} election epoch 2, quorum 0 mon osdmap e49: 11 osds: 9 up, 9 in pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects 298 MB used, 4189 GB / 4189 GB avail 246 creating 32 stale+active+undersized+degraded 32 stale+active+remapped ceph> health HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs stuck inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck undersized; 32 pgs undersized; pool rbd pg_num 310 > pgp_num 64 ceph> quorum_status {"election_epoch":2,"quorum":[0],"quorum_names":["mon"],"quorum_leader_name":"mon","monmap":{"epoch":1,"fsid":"b4419183-5320-4701-aae2-eb61e186b443","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"mon","addr":"172.17.1.16:6789\/0"}]}} ceph> mon_status {"name":"mon","rank":0,"state":"leader","election_epoch":2,"quorum":[0],"outside_quorum":[],"extra_probe_peers":[],"sync_provider":[],"monmap":{"epoch":1,"fsid":"b4419183-5320-4701-aae2-eb61e186b443","modified":"0.000000","created":"0.000000","mons":[{"rank":0,"name":"mon","addr":"172.17.1.16:6789\/0"}]}} ceph> osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 4.94997 root default -2 2.24998 host osd1 0 0.45000 osd.0 down 0 1.00000 1 0.45000 osd.1 down 0 1.00000 2 0.45000 osd.2 up 1.00000 1.00000 3 0.45000 osd.3 up 1.00000 1.00000 10 0.45000 osd.10 up 1.00000 1.00000 -3 1.34999 host osd2 4 0.45000 osd.4 up 1.00000 1.00000 5 0.45000 osd.5 up 1.00000 1.00000 6 0.45000 osd.6 up 1.00000 1.00000 -4 1.34999 host osd3 7 0.45000 osd.7 up 1.00000 1.00000 8 0.45000 osd.8 up 1.00000 1.00000 9 0.45000 osd.9 up 1.00000 1.00000 Admin-node: [root@admin test-cluster]# cat ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 172.17.1.16 mon_initial_members = mon fsid = b4419183-5320-4701-aae2-eb61e186b443 osd pool default size = 2 public network = 172.17.1.0/24 cluster network = 10.0.0.0/24 How do I diagnose and solve the cluster health issue? Do you need any additional information to help with the diag process? Thanks!! Dave |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com