Dear List, We have a cluster in Jewel 10.2.2 under ubuntu 16.04. The cluster is compose by 12 nodes, each nodes have 10 OSD with journal on disk. We have one rbd partition and a radosGW with 2 data pool, one replicated, one EC (8+2) in attachment few details on our cluster. Currently, our cluster is not usable at all due to too much OSD instability. OSDs daemon die randomly with "hit suicide timeout". Yesterday, all of 120 OSDs died at least 12 time (max 74 time) with an average around 40 time here logs from ceph mon and from one OSD : http://icwww.epfl.ch/~ymoulin/ceph/cephprod.log.bz2 (6MB) http://icwww.epfl.ch/~ymoulin/ceph/cephprod-osd.10.log.bz2 (6MB) We have stopped all clients i/o to see if the cluster get stable without success, to avoid endless rebalancing with OSD flapping, we had to "set noout" the cluster. For now we have no idea what's going on. Anyone can help us to understand what's happening ? thanks for your help -- Yoann Moulin EPFL IC-IT
$ ceph --version ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374) $ uname -a Linux icadmin004 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ ceph osd pool ls detail pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4927 flags hashpspool stripe_width 0 removed_snaps [1~3] pool 3 '.rgw.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 258 flags hashpspool stripe_width 0 pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 259 flags hashpspool stripe_width 0 pool 5 'default.rgw.data.root' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 260 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 6 'default.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 261 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 7 'default.rgw.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 262 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 8 'erasure.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 271 flags hashpspool stripe_width 0 pool 9 'erasure.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 272 flags hashpspool stripe_width 0 pool 11 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 276 flags hashpspool stripe_width 0 pool 12 'default.rgw.buckets.extra' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 277 flags hashpspool stripe_width 0 pool 14 'default.rgw.users.uid' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 311 flags hashpspool stripe_width 0 pool 15 'default.rgw.users.keys' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 313 flags hashpspool stripe_width 0 pool 16 'default.rgw.meta' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 315 flags hashpspool stripe_width 0 pool 17 'default.rgw.users.swift' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 320 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 18 'default.rgw.users.email' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 322 owner 18446744073709551615 flags hashpspool stripe_width 0 pool 19 'default.rgw.usage' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 353 flags hashpspool stripe_width 0 pool 20 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 4096 pgp_num 4096 last_change 4918 flags hashpspool stripe_width 0 pool 26 '.rgw.control' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3549 flags hashpspool stripe_width 0 pool 27 '.rgw' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3551 flags hashpspool stripe_width 0 pool 28 '.rgw.gc' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3552 flags hashpspool stripe_width 0 pool 29 '.log' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 3553 flags hashpspool stripe_width 0 pool 30 'test' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 4910 flags hashpspool stripe_width 0 pool 31 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 4912 flags hashpspool stripe_width 0 pool 34 'cephfs_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 26021 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 35 'cephfs_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 128 pgp_num 128 last_change 26019 flags hashpspool stripe_width 0 pool 37 'erasure.rgw.buckets' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 31463 flags hashpspool stripe_width 0 pool 38 'default.rgw.buckets' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 8 pgp_num 8 last_change 31466 flags hashpspool stripe_width 0 pool 39 'erasure.rgw.buckets.data' erasure size 10 min_size 8 crush_ruleset 3 object_hash rjenkins pg_num 128 pgp_num 128 last_change 31469 flags hashpspool stripe_width 4096 $ ceph -s cluster f9dfd27f-c704-4d53-9aa0-4a23d655c7c4 health HEALTH_ERR 95 pgs are stuck inactive for more than 300 seconds 1577 pgs degraded 15 pgs down 15 pgs peering 95 pgs stuck inactive 1497 pgs stuck unclean 1577 pgs undersized 1 requests are blocked > 32 sec recovery 14191345/255016286 objects degraded (5.565%) recovery 1595762/255016286 objects misplaced (0.626%) 7/120 in osds are down noout,sortbitwise flag(s) set monmap e1: 3 mons at {node002.cluster.localdomain=10.90.37.3:6789/0,node010.cluster.localdomain=10.90.37.11:6789/0,node018.cluster.localdomain=10.90.37.19:6789/0} election epoch 64, quorum 0,1,2 node002.cluster.localdomain,node010.cluster.localdomain,node018.cluster.localdomain fsmap e131: 1/1/1 up {0=node022.cluster.localdomain=up:active}, 2 up:standby osdmap e72842: 144 osds: 137 up, 120 in; 16 remapped pgs flags noout,sortbitwise pgmap v4819062: 9408 pgs, 28 pools, 153 TB data, 75849 kobjects 449 TB used, 203 TB / 653 TB avail 14191345/255016286 objects degraded (5.565%) 1595762/255016286 objects misplaced (0.626%) 7810 active+clean 1497 active+undersized+degraded 80 undersized+degraded+peered 15 down+remapped+peering 4 active+clean+scrubbing 2 active+clean+scrubbing+deep client io 0 B/s wr, 0 op/s rd, 23 op/s wr $ ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 653T 203T 449T 68.83 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 50122G 21.18 38608G 12912190 .rgw.root 3 2856 0 38608G 15 default.rgw.control 4 0 0 38608G 12 default.rgw.data.root 5 19800 0 38608G 64 default.rgw.gc 6 0 0 38608G 34 default.rgw.log 7 0 0 38608G 285 erasure.rgw.buckets.index 8 0 0 38608G 6 erasure.rgw.buckets.extra 9 0 0 38608G 119 default.rgw.buckets.index 11 0 0 38608G 49 default.rgw.buckets.extra 12 0 0 38608G 115 default.rgw.users.uid 14 3817 0 38608G 12 default.rgw.users.keys 15 206 0 38608G 17 default.rgw.meta 16 40330 0 38608G 127 default.rgw.users.swift 17 21 0 38608G 2 default.rgw.users.email 18 79 0 38608G 6 default.rgw.usage 19 0 0 38608G 6 default.rgw.buckets.data 20 99929G 42.28 38608G 61525581 .rgw.control 26 0 0 38608G 8 .rgw 27 0 0 38608G 0 .rgw.gc 28 0 0 38608G 0 .log 29 0 0 38608G 0 test 30 0 0 38608G 0 data 31 5478M 0 38608G 87663 cephfs_data 34 0 0 38608G 0 cephfs_metadata 35 2068 0 38608G 20 erasure.rgw.buckets 37 0 0 38608G 0 default.rgw.buckets 38 0 0 38608G 0 erasure.rgw.buckets.data 39 7604G 1.35 92661G 3143729
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com