Re: OSD not marked as down or out

Xavier Villaneau <xavier.villaneau@xxxxxxxxxxxx> · Fri, 20 Feb 2015 12:44:18 +0100

Hello,

Le 20/02/2015 12:26, Sudarshan Pathak a écrit :
Hello everyone,

I have a cluster running with OpenStack. It has 6 OSD (3 in each 2 
different locations). Each pool has 3 replication size with 2 copy in 
primary location and 1 copy at secondary location.

Everything is running as expected but the osd are not marked as down 
when I poweroff a OSD server. It has been around an hour.
I tried changing the heartbeat settings too.

I also had this issue with a new cluster running Giant. Did you check 
the CRUSH tunables ? In my case, forcing the CRUSH profile to "firefly" 
solved the problem.
I found this fix on 
http://ceph.com/docs/master/rados/operations/crush-map/#tunables

Regards,
--
Xavier

Can someone point me in right direction.

OSD 0 log
=========
2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: 
no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 
2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720)
2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: 
no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 
2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907)
2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: 
no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 
2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119)
2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: 
no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 
2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165)

Ceph monitor log
================
2015-02-20 16:49:16.831548 7f416e4aa700  1 mon.storage1@1(leader).osd 
e455 prepare_failure osd.2 192.168.100.33:6800/24431 
<http://192.168.100.33:6800/24431> from osd.4 192.168.100.35:6800/1305 
<http://192.168.100.35:6800/1305> is reporting failure:1
2015-02-20 16:49:16.831593 7f416e4aa700  0 log_channel(cluster) log 
[DBG] : osd.2 192.168.100.33:6800/24431 
<http://192.168.100.33:6800/24431> reported failed by osd.4 
192.168.100.35:6800/1305 <http://192.168.100.35:6800/1305>
2015-02-20 16:49:17.080314 7f416e4aa700  1 mon.storage1@1(leader).osd 
e455 prepare_failure osd.2 192.168.100.33:6800/24431 
<http://192.168.100.33:6800/24431> from osd.3 192.168.100.34:6800/1358 
<http://192.168.100.34:6800/1358> is reporting failure:1
2015-02-20 16:49:17.080527 7f416e4aa700  0 log_channel(cluster) log 
[DBG] : osd.2 192.168.100.33:6800/24431 
<http://192.168.100.33:6800/24431> reported failed by osd.3 
192.168.100.34:6800/1358 <http://192.168.100.34:6800/1358>
2015-02-20 16:49:17.420859 7f416e4aa700  1 mon.storage1@1(leader).osd 
e455 prepare_failure osd.2 192.168.100.33:6800/24431 
<http://192.168.100.33:6800/24431> from osd.5 192.168.100.36:6800/1359 
<http://192.168.100.36:6800/1359> is reporting failure:1

#ceph osd stat
     osdmap e455: 6 osds: 6 up, 6 in

#ceph -s
    cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc
     health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 
pgs stuck unclean; 1 requests are blocked > 32 sec; 1 mons down, 
quorum 1,2,3,4 storage1,storage2,compute3,compute4
     monmap e1: 5 mons at 
{admin=192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0 
<http://192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0>}, 
election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4
     osdmap e455: 6 osds: 6 up, 6 in
      pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects
            82443 MB used, 2682 GB / 2763 GB avail
                3122 active+clean
                 528 remapped+peering

Ceph.conf file

[global]
fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc
mon_initial_members = admin, storage1, storage2, compute3, compute4
mon_host = 
192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

osd pool default size = 3
osd pool default min size = 3

osd pool default pg num = 300
osd pool default pgp num = 300

public network = 192.168.100.0/24 <http://192.168.100.0/24>

rgw print continue = false
rgw enable ops log = false

mon osd report timeout = 60
mon osd down out interval = 30
mon osd min down reports = 2

osd heartbeat grace = 10
osd mon heartbeat interval = 20
osd mon report interval max = 60
osd mon ack timeout = 15

mon osd min down reports = 2

Regards,
Sudarshan Pathak

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com