Ceph on RHEL 7 with multiple OSD's

bglackin@xxxxxxx (BG) · Tue, 9 Sep 2014 10:05:11 +0000 (UTC)

Loic Dachary <loic at ...> writes:

> 
> Hi,
> 
> It it looks like your osd.0 is down and you only have one osd left (osd.1)
> which would explain why the cluster cannot get to a healthy state. The "size
> 2" in  "pool 0 'data' replicated size 2 ..." means the pool needs at
> least two OSDs up to function properly. Do you know why the osd.0 is not up ?
> 
> Cheers
> 

I've been trying unsuccessfully to get this up and running since. I've added
another OSD but still can't get to "active + clean" state. I'm not even sure if
the problems I'm having are related to the OS version but I'm running out of
ideas and unless somebody here can spot something obvious in the logs below I'm
going to try rolling back to CentOS 6.

$ echo "HEALTH" && ceph health && echo "STATUS" && ceph status && echo
"OSD_DUMP" && ceph osd dump
HEALTH
HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
STATUS
    cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521
     health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
     monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2, quorum
     0 hp09
     osdmap e43: 3 osds: 3 up, 3 in
      pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects
            15469 MB used, 368 GB / 383 GB avail
                 129 peering
                  63 active+clean
OSD_DUMP
epoch 43
fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521
created 2014-09-09 10:42:35.490711
modified 2014-09-09 10:47:25.077178
flags 
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45
stripe_width 0
pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
max_osd 3
osd.0 up   in  weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval
[0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988 10.119.16.14:6802/24988
10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b
osd.1 up   in  weight 1 up_from 38 up_thru 42 down_at 36 last_clean_interval
[7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999
10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up
8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876
osd.2 up   in  weight 1 up_from 42 up_thru 42 down_at 40 last_clean_interval
[11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605
10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up
5d398bba-59f5-41f8-9bd6-aed6a0204656

Sample of warnings from monitor log:
2014-09-09 10:51:10.636325 7f75037d0700  1 mon.hp09 at 0(leader).osd e72
prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2
10.119.16.16:6800/25605 is reporting failure:1
2014-09-09 10:51:10.636343 7f75037d0700  0 log [DBG] : osd.1
10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605

Sample of warnings from osd.2 log:
2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no reply
from osd.1 ever on either front or back, first ping sent 2014-09-09
10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713)
2014-09-09 10:44:13.724883 7fb81f2f9700  0 log [WRN] : map e19 wrongly marked
me down
2014-09-09 10:44:13.726104 7fb81f2f9700  0 osd.2 19 crush map has features
1107558400, adjusting msgr requires for mons
2014-09-09 10:44:13.726741 7fb811edb700  0 -- 10.119.16.16:0/25605 >>
10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1
c=0x3ad8580).fault