Ceph on RHEL 7 with multiple OSD's

mkozanecki@xxxxxxxxxx (Michal Kozanecki) · Tue, 9 Sep 2014 13:10:09 +0000

Network issue maybe? Have you checked your firewall settings? Iptables changed a bit in EL7 and might of broken any rules your normally try and use, try flushing the rules (iptables -F) and see if that fixes things, if you then you'll need to fix your firewall rules. 

I ran into a similar issue on EL7 where the OSD's appeared up and in, but were stuck in peering which was due to a few ports being blocked.

Cheers

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of BG
Sent: September-09-14 6:05 AM
To: ceph-users at lists.ceph.com
Subject: Re: Ceph on RHEL 7 with multiple OSD's

Loic Dachary <loic at ...> writes:

> 
> Hi,
> 
> It it looks like your osd.0 is down and you only have one osd left 
> (osd.1) which would explain why the cluster cannot get to a healthy 
> state. The "size 2" in  "pool 0 'data' replicated size 2 ..." means 
> the pool needs at least two OSDs up to function properly. Do you know why the osd.0 is not up ?
> 
> Cheers
> 

I've been trying unsuccessfully to get this up and running since. I've added another OSD but still can't get to "active + clean" state. I'm not even sure if the problems I'm having are related to the OS version but I'm running out of ideas and unless somebody here can spot something obvious in the logs below I'm going to try rolling back to CentOS 6.

$ echo "HEALTH" && ceph health && echo "STATUS" && ceph status && echo "OSD_DUMP" && ceph osd dump HEALTH HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean STATUS
    cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521
     health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
     monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2, quorum
     0 hp09
     osdmap e43: 3 osds: 3 up, 3 in
      pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects
            15469 MB used, 368 GB / 383 GB avail
                 129 peering
                  63 active+clean
OSD_DUMP
epoch 43
fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521
created 2014-09-09 10:42:35.490711
modified 2014-09-09 10:47:25.077178
flags
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 max_osd 3
osd.0 up   in  weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval
[0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988 10.119.16.14:6802/24988
10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b
osd.1 up   in  weight 1 up_from 38 up_thru 42 down_at 36 last_clean_interval
[7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999
10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up
8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876
osd.2 up   in  weight 1 up_from 42 up_thru 42 down_at 40 last_clean_interval
[11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605
10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up
5d398bba-59f5-41f8-9bd6-aed6a0204656

Sample of warnings from monitor log:
2014-09-09 10:51:10.636325 7f75037d0700  1 mon.hp09 at 0(leader).osd e72 prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2
10.119.16.16:6800/25605 is reporting failure:1
2014-09-09 10:51:10.636343 7f75037d0700  0 log [DBG] : osd.1
10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605

Sample of warnings from osd.2 log:
2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no reply from osd.1 ever on either front or back, first ping sent 2014-09-09
10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713)
2014-09-09 10:44:13.724883 7fb81f2f9700  0 log [WRN] : map e19 wrongly marked me down
2014-09-09 10:44:13.726104 7fb81f2f9700  0 osd.2 19 crush map has features 1107558400, adjusting msgr requires for mons
2014-09-09 10:44:13.726741 7fb811edb700  0 -- 10.119.16.16:0/25605 >>
10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1 c=0x3ad8580).fault

_______________________________________________
ceph-users mailing list
ceph-users at lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com