Re: OSD not marked as down or out

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Le 20/02/2015 12:26, Sudarshan Pathak a écrit :
Hello everyone,

I have a cluster running with OpenStack. It has 6 OSD (3 in each 2 different locations). Each pool has 3 replication size with 2 copy in primary location and 1 copy at secondary location.

Everything is running as expected but the osd are not marked as down when I poweroff a OSD server. It has been around an hour.
I tried changing the heartbeat settings too.

I also had this issue with a new cluster running Giant. Did you check the CRUSH tunables ? In my case, forcing the CRUSH profile to "firefly" solved the problem. I found this fix on http://ceph.com/docs/master/rados/operations/crush-map/#tunables

Regards,
--
Xavier

Can someone point me in right direction.

OSD 0 log
=========
2015-02-20 16:20:14.009723 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:54.009720) 2015-02-20 16:20:15.009908 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:55.009907) 2015-02-20 16:20:16.010123 7f3fe37d7700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:56.010119) 2015-02-20 16:20:16.648167 7f3fc9a76700 -1 osd.0 451 heartbeat_check: no reply from osd.2 since back 2015-02-20 16:15:54.607854 front 2015-02-20 16:15:54.607854 (cutoff 2015-02-20 16:19:56.648165)


Ceph monitor log
================
2015-02-20 16:49:16.831548 7f416e4aa700 1 mon.storage1@1(leader).osd e455 prepare_failure osd.2 192.168.100.33:6800/24431 <http://192.168.100.33:6800/24431> from osd.4 192.168.100.35:6800/1305 <http://192.168.100.35:6800/1305> is reporting failure:1 2015-02-20 16:49:16.831593 7f416e4aa700 0 log_channel(cluster) log [DBG] : osd.2 192.168.100.33:6800/24431 <http://192.168.100.33:6800/24431> reported failed by osd.4 192.168.100.35:6800/1305 <http://192.168.100.35:6800/1305> 2015-02-20 16:49:17.080314 7f416e4aa700 1 mon.storage1@1(leader).osd e455 prepare_failure osd.2 192.168.100.33:6800/24431 <http://192.168.100.33:6800/24431> from osd.3 192.168.100.34:6800/1358 <http://192.168.100.34:6800/1358> is reporting failure:1 2015-02-20 16:49:17.080527 7f416e4aa700 0 log_channel(cluster) log [DBG] : osd.2 192.168.100.33:6800/24431 <http://192.168.100.33:6800/24431> reported failed by osd.3 192.168.100.34:6800/1358 <http://192.168.100.34:6800/1358> 2015-02-20 16:49:17.420859 7f416e4aa700 1 mon.storage1@1(leader).osd e455 prepare_failure osd.2 192.168.100.33:6800/24431 <http://192.168.100.33:6800/24431> from osd.5 192.168.100.36:6800/1359 <http://192.168.100.36:6800/1359> is reporting failure:1


#ceph osd stat
     osdmap e455: 6 osds: 6 up, 6 in


#ceph -s
    cluster c8a5975f-4c86-4cfe-a91b-fac9f3126afc
health HEALTH_WARN 528 pgs peering; 528 pgs stuck inactive; 528 pgs stuck unclean; 1 requests are blocked > 32 sec; 1 mons down, quorum 1,2,3,4 storage1,storage2,compute3,compute4 monmap e1: 5 mons at {admin=192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0 <http://192.168.100.39:6789/0,compute3=192.168.100.133:6789/0,compute4=192.168.100.134:6789/0,storage1=192.168.100.120:6789/0,storage2=192.168.100.121:6789/0>}, election epoch 132, quorum 1,2,3,4 storage1,storage2,compute3,compute4
     osdmap e455: 6 osds: 6 up, 6 in
      pgmap v48474: 3650 pgs, 19 pools, 27324 MB data, 4420 objects
            82443 MB used, 2682 GB / 2763 GB avail
                3122 active+clean
                 528 remapped+peering



Ceph.conf file

[global]
fsid = c8a5975f-4c86-4cfe-a91b-fac9f3126afc
mon_initial_members = admin, storage1, storage2, compute3, compute4
mon_host = 192.168.100.39,192.168.100.120,192.168.100.121,192.168.100.133,192.168.100.134
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

osd pool default size = 3
osd pool default min size = 3

osd pool default pg num = 300
osd pool default pgp num = 300

public network = 192.168.100.0/24 <http://192.168.100.0/24>

rgw print continue = false
rgw enable ops log = false

mon osd report timeout = 60
mon osd down out interval = 30
mon osd min down reports = 2

osd heartbeat grace = 10
osd mon heartbeat interval = 20
osd mon report interval max = 60
osd mon ack timeout = 15

mon osd min down reports = 2


Regards,
Sudarshan Pathak


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux