can't start OSD

Rémi BUISSON <remi-buisson@xxxxxxxxx> · Fri, 10 Nov 2017 17:31:34 +0100

Hello,

I have some issues to restart down OSDs.

My cluster is running on debian stretch (with backported kernel 4.13.0) 
with luminous version (12.2.0).

An admin changed the fsid and did restart the OSDs of one machine. I 
don't know if it can be the cause of all of this but my cluster is in 
HEALTH_ERR and some PG are down or inactive. Now the good config is back 
but some OSDs of my cluster (on other machines too) can't start.

Here is the health detail:

HEALTH_ERR 2282635/254779209 objects misplaced (0.896%); Reduced data 
availability: 3 pgs inactive, 1 pg down; Degraded data redundancy: 
2837613/254779209 objects degraded (1.114%), 93 pgs unclean, 70 pgs 
degraded, 64 pgs undersized; 4017 stuck requests are blocked > 4096 sec
OBJECT_MISPLACED 2282635/254779209 objects misplaced (0.896%)
PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 1 pg down
    pg 14.12a is down, acting [28,13,19]
    pg 14.15d is stuck inactive for 5344.345563, current state unknown, 
last acting []
    pg 14.1d7 is stuck inactive for 4306.284248, current state 
undersized+degraded+remapped+backfilling+peered, last acting [13]
PG_DEGRADED Degraded data redundancy: 2837613/254779209 objects degraded 
(1.114%), 93 pgs unclean, 70 pgs degraded, 64 pgs undersized
    pg 10.3 is stuck unclean for 5483.175862, current state 
active+remapped+backfill_wait, last acting [35,44,30]
    pg 10.1f is active+recovery_wait+degraded, acting [56,8,52]
    pg 14.0 is stuck undersized for 6003.911469, current state 
active+undersized+degraded+remapped+backfilling, last acting [13,42]
    pg 14.21 is stuck undersized for 437.855288, current state 
active+undersized+degraded+remapped+backfilling, last acting [40,59]
    pg 14.2b is stuck unclean for 123.787607, current state 
active+remapped+backfill_wait, last acting [62,30,24]
    pg 14.4a is stuck undersized for 723.893114, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [43,22]
    pg 14.56 is stuck unclean for 123.821351, current state 
active+remapped+backfill_wait, last acting [56,43,63]
    pg 14.1fe is stuck undersized for 123.800787, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [63,8]
    pg 14.20a is stuck unclean for 24341.489625, current state 
active+remapped+backfill_wait, last acting [20,28,37]
    pg 14.20b is stuck unclean for 24351.403819, current state 
active+remapped+backfill_wait, last acting [60,6,57]
    pg 14.21d is stuck unclean for 24345.292525, current state 
active+remapped+backfill_wait, last acting [59,62,10]
    pg 14.226 is stuck undersized for 363.681151, current state 
active+undersized+degraded+remapped+backfilling, last acting [44,19]
    pg 14.22c is stuck unclean for 123.793121, current state 
active+remapped+backfill_wait, last acting [16,40,9]
    pg 14.236 is stuck undersized for 163.374339, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [61,6]
    pg 14.240 is stuck undersized for 437.857887, current state 
active+undersized+degraded+remapped+backfilling, last acting [57,27]
    pg 14.24d is stuck undersized for 115.191726, current state 
active+undersized+degraded+remapped+backfilling, last acting [19,27]
    pg 14.268 is stuck undersized for 7932.097742, current state 
active+undersized+degraded+remapped+backfilling, last acting [12,58]
    pg 14.27d is stuck unclean for 7935.169818, current state 
active+remapped+backfilling, last acting [12,47,8]
    pg 14.290 is stuck undersized for 437.855071, current state 
active+undersized+degraded+remapped+backfilling, last acting [29,3]
    pg 14.2aa is stuck undersized for 114.181416, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [3,46]
    pg 14.2ac is stuck undersized for 123.821179, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [47,18]
    pg 14.2b9 is stuck undersized for 3704.234924, current state 
active+undersized+degraded+remapped+backfilling, last acting [13,38]
    pg 14.2c4 is stuck undersized for 123.824405, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [15,36]
    pg 14.2c5 is stuck undersized for 161.266102, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [63,44]
    pg 14.2e0 is stuck undersized for 438.862093, current state 
active+undersized+degraded+remapped+backfilling, last acting [9,21]
    pg 14.2eb is stuck undersized for 437.860653, current state 
active+undersized+degraded+remapped+backfilling, last acting [8,34]
    pg 14.2f8 is stuck undersized for 163.373209, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [61,28]
    pg 14.305 is stuck undersized for 723.892233, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [9,40]
    pg 14.320 is stuck unclean for 123.788128, current state 
active+remapped+backfill_wait, last acting [62,6,5]
    pg 14.322 is stuck undersized for 437.856055, current state 
active+undersized+degraded+remapped+backfilling, last acting [59,20]
    pg 14.32c is stuck undersized for 3703.227571, current state 
active+undersized+degraded+remapped+backfilling, last acting [8,35]
    pg 14.34c is stuck undersized for 161.271281, current state 
active+undersized+degraded+remapped+backfilling, last acting [15,63]
    pg 14.350 is stuck undersized for 437.860280, current state 
active+undersized+degraded+remapped+backfilling, last acting [14,5]
    pg 14.397 is stuck undersized for 7932.112171, current state 
active+undersized+degraded+remapped+backfilling, last acting [12,36]
    pg 14.398 is stuck undersized for 3703.121001, current state 
active+undersized+degraded+remapped+backfilling, last acting [9,60]
    pg 14.399 is stuck undersized for 593.828981, current state 
active+undersized+degraded+remapped+backfilling, last acting [8,56]
    pg 14.39e is stuck unclean for 138.073532, current state 
active+remapped+backfill_wait, last acting [44,3,60]
    pg 14.3a5 is stuck undersized for 161.266621, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [63,28]
    pg 14.3a8 is stuck undersized for 161.269743, current state 
active+undersized+degraded+remapped+backfilling, last acting [46,59]
    pg 14.3b2 is stuck undersized for 7932.093694, current state 
active+undersized+degraded+remapped+backfilling, last acting [12,1]
    pg 14.3ca is stuck undersized for 724.899933, current state 
active+undersized+degraded+remapped+backfilling, last acting [9,31]
    pg 14.3cc is stuck undersized for 115.185775, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [42,9]
    pg 14.3ea is stuck unclean for 8143.713642, current state 
active+remapped+backfilling, last acting [13,57,62]
    pg 14.3ed is stuck undersized for 361.684445, current state 
active+undersized+degraded+remapped+backfilling, last acting [13,5]
    pg 14.3f2 is stuck undersized for 437.859470, current state 
active+undersized+degraded+remapped+backfilling, last acting [11,31]
    pg 14.3f3 is stuck undersized for 363.686095, current state 
active+undersized+degraded+remapped+backfilling, last acting [12,44]
    pg 14.3fd is stuck undersized for 437.859446, current state 
active+undersized+degraded+remapped+backfill_wait, last acting [19,57]
    pg 35.1e is active+recovery_wait+degraded, acting [2,52,41]
    pg 39.11 is active+recovery_wait+degraded, acting [15,19,53]
    pg 40.1 is active+recovery_wait+degraded, acting [13,41,52]
    pg 41.b is active+recovery_wait+degraded, acting [56,52,29]
REQUEST_STUCK 4017 stuck requests are blocked > 4096 sec
    207 ops are blocked > 33554.4 sec
    3769 ops are blocked > 16777.2 sec
    41 ops are blocked > 8388.61 sec
    osd.21 has stuck requests > 33554.4 sec

The down OSDs don't start and we observed the following errors in logs:

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) 
luminous (rc)
 1: (()+0xa07bb4) [0x561641bdebb4]
 2: (()+0x110c0) [0x7f108f4c30c0]
 3: (gsignal()+0xcf) [0x7f108e48afcf]
 4: (abort()+0x16a) [0x7f108e48c3fa]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x28e) [0x561641c2652e]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
const&)+0x434) [0x5616418a5964]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, 
std::allocator<int> > const&, std::vector<int, std::allocator<int> > 
const&, int, int, std::vector<int, std::allocator<int> > const&, 
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned 
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, 
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x395) 
[0x561641882ae5]
 8: (OSD::build_past_intervals_parallel()+0xc59) [0x56164163b9e9]
 9: (OSD::load_pgs()+0x147b) [0x56164163e27b]
 10: (OSD::init()+0x2227) [0x5616416565b7]
 11: (main()+0x2eb8) [0x561641568d38]
 12: (__libc_start_main()+0xf1) [0x7f108e4782b1]
 13: (_start()+0x2a) [0x5616415f2a0a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- begin dump of recent events ---
     0> 2017-11-10 15:14:26.032876 7f1091ee2e40 -1 *** Caught signal 
(Aborted) **
 in thread 7f1091ee2e40 thread_name:ceph-osd

 ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c) 
luminous (rc)
 1: (()+0xa07bb4) [0x561641bdebb4]
 2: (()+0x110c0) [0x7f108f4c30c0]
 3: (gsignal()+0xcf) [0x7f108e48afcf]
 4: (abort()+0x16a) [0x7f108e48c3fa]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x28e) [0x561641c2652e]
 6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t 
const&)+0x434) [0x5616418a5964]
 7: (PastIntervals::check_new_interval(int, int, std::vector<int, 
std::allocator<int> > const&, std::vector<int, std::allocator<int> > 
const&, int, int, std::vector<int, std::allocator<int> > const&, 
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned 
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t, 
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x395) 
[0x561641882ae5]
 8: (OSD::build_past_intervals_parallel()+0xc59) [0x56164163b9e9]
 9: (OSD::load_pgs()+0x147b) [0x56164163e27b]
 10: (OSD::init()+0x2227) [0x5616416565b7]
 11: (main()+0x2eb8) [0x561641568d38]
 12: (__libc_start_main()+0xf1) [0x7f108e4782b1]
 13: (_start()+0x2a) [0x5616415f2a0a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
   0/ 0 none
   0/ 0 lockdep
   0/ 0 context
   0/ 0 crush
   0/ 0 mds
   0/ 0 mds_balancer
   0/ 0 mds_locker
   0/ 0 mds_log
   0/ 0 mds_log_expire
   0/ 0 mds_migrator
   0/ 0 buffer
   0/ 0 timer
   0/ 0 filer
   0/ 0 striper
   0/ 0 objecter
   0/ 0 rados
   0/ 0 rbd
   0/ 5 rbd_mirror
   0/ 0 rbd_replay
   0/ 0 journaler
   0/ 0 objectcacher
   0/ 0 client
   0/ 0 osd
   0/ 0 optracker
   0/ 0 objclass
   0/ 0 filestore
   0/ 0 journal
   0/ 0 ms
   0/ 0 mon
   0/ 0 monc
   0/ 0 paxos
   0/ 0 tp
   0/ 0 auth
   0/ 0 crypto
   0/ 0 finisher
   0/ 0 heartbeatmap
   0/ 0 perfcounter
   0/ 0 rgw
   0/ 0 civetweb
   0/ 0 javaclient
   0/ 0 asok
   0/ 0 throttle
   0/ 0 refs
   0/ 0 xio
   0/ 0 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   4/ 5 memdb
   1/ 5 kinetic
   1/ 5 fuse
   1/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-osd.26.log
--- end dump of recent events ---

This seems to be exactly the same bug as 
http://tracker.ceph.com/issues/21142.

Can somebody help me please ?

Thanks in advance :-)

Rémi

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com