Hello,
I have some issues to restart down OSDs.
My cluster is running on debian stretch (with backported kernel 4.13.0)
with luminous version (12.2.0).
An admin changed the fsid and did restart the OSDs of one machine. I
don't know if it can be the cause of all of this but my cluster is in
HEALTH_ERR and some PG are down or inactive. Now the good config is back
but some OSDs of my cluster (on other machines too) can't start.
Here is the health detail:
HEALTH_ERR 2282635/254779209 objects misplaced (0.896%); Reduced data
availability: 3 pgs inactive, 1 pg down; Degraded data redundancy:
2837613/254779209 objects degraded (1.114%), 93 pgs unclean, 70 pgs
degraded, 64 pgs undersized; 4017 stuck requests are blocked > 4096 sec
OBJECT_MISPLACED 2282635/254779209 objects misplaced (0.896%)
PG_AVAILABILITY Reduced data availability: 3 pgs inactive, 1 pg down
pg 14.12a is down, acting [28,13,19]
pg 14.15d is stuck inactive for 5344.345563, current state unknown,
last acting []
pg 14.1d7 is stuck inactive for 4306.284248, current state
undersized+degraded+remapped+backfilling+peered, last acting [13]
PG_DEGRADED Degraded data redundancy: 2837613/254779209 objects degraded
(1.114%), 93 pgs unclean, 70 pgs degraded, 64 pgs undersized
pg 10.3 is stuck unclean for 5483.175862, current state
active+remapped+backfill_wait, last acting [35,44,30]
pg 10.1f is active+recovery_wait+degraded, acting [56,8,52]
pg 14.0 is stuck undersized for 6003.911469, current state
active+undersized+degraded+remapped+backfilling, last acting [13,42]
pg 14.21 is stuck undersized for 437.855288, current state
active+undersized+degraded+remapped+backfilling, last acting [40,59]
pg 14.2b is stuck unclean for 123.787607, current state
active+remapped+backfill_wait, last acting [62,30,24]
pg 14.4a is stuck undersized for 723.893114, current state
active+undersized+degraded+remapped+backfill_wait, last acting [43,22]
pg 14.56 is stuck unclean for 123.821351, current state
active+remapped+backfill_wait, last acting [56,43,63]
pg 14.1fe is stuck undersized for 123.800787, current state
active+undersized+degraded+remapped+backfill_wait, last acting [63,8]
pg 14.20a is stuck unclean for 24341.489625, current state
active+remapped+backfill_wait, last acting [20,28,37]
pg 14.20b is stuck unclean for 24351.403819, current state
active+remapped+backfill_wait, last acting [60,6,57]
pg 14.21d is stuck unclean for 24345.292525, current state
active+remapped+backfill_wait, last acting [59,62,10]
pg 14.226 is stuck undersized for 363.681151, current state
active+undersized+degraded+remapped+backfilling, last acting [44,19]
pg 14.22c is stuck unclean for 123.793121, current state
active+remapped+backfill_wait, last acting [16,40,9]
pg 14.236 is stuck undersized for 163.374339, current state
active+undersized+degraded+remapped+backfill_wait, last acting [61,6]
pg 14.240 is stuck undersized for 437.857887, current state
active+undersized+degraded+remapped+backfilling, last acting [57,27]
pg 14.24d is stuck undersized for 115.191726, current state
active+undersized+degraded+remapped+backfilling, last acting [19,27]
pg 14.268 is stuck undersized for 7932.097742, current state
active+undersized+degraded+remapped+backfilling, last acting [12,58]
pg 14.27d is stuck unclean for 7935.169818, current state
active+remapped+backfilling, last acting [12,47,8]
pg 14.290 is stuck undersized for 437.855071, current state
active+undersized+degraded+remapped+backfilling, last acting [29,3]
pg 14.2aa is stuck undersized for 114.181416, current state
active+undersized+degraded+remapped+backfill_wait, last acting [3,46]
pg 14.2ac is stuck undersized for 123.821179, current state
active+undersized+degraded+remapped+backfill_wait, last acting [47,18]
pg 14.2b9 is stuck undersized for 3704.234924, current state
active+undersized+degraded+remapped+backfilling, last acting [13,38]
pg 14.2c4 is stuck undersized for 123.824405, current state
active+undersized+degraded+remapped+backfill_wait, last acting [15,36]
pg 14.2c5 is stuck undersized for 161.266102, current state
active+undersized+degraded+remapped+backfill_wait, last acting [63,44]
pg 14.2e0 is stuck undersized for 438.862093, current state
active+undersized+degraded+remapped+backfilling, last acting [9,21]
pg 14.2eb is stuck undersized for 437.860653, current state
active+undersized+degraded+remapped+backfilling, last acting [8,34]
pg 14.2f8 is stuck undersized for 163.373209, current state
active+undersized+degraded+remapped+backfill_wait, last acting [61,28]
pg 14.305 is stuck undersized for 723.892233, current state
active+undersized+degraded+remapped+backfill_wait, last acting [9,40]
pg 14.320 is stuck unclean for 123.788128, current state
active+remapped+backfill_wait, last acting [62,6,5]
pg 14.322 is stuck undersized for 437.856055, current state
active+undersized+degraded+remapped+backfilling, last acting [59,20]
pg 14.32c is stuck undersized for 3703.227571, current state
active+undersized+degraded+remapped+backfilling, last acting [8,35]
pg 14.34c is stuck undersized for 161.271281, current state
active+undersized+degraded+remapped+backfilling, last acting [15,63]
pg 14.350 is stuck undersized for 437.860280, current state
active+undersized+degraded+remapped+backfilling, last acting [14,5]
pg 14.397 is stuck undersized for 7932.112171, current state
active+undersized+degraded+remapped+backfilling, last acting [12,36]
pg 14.398 is stuck undersized for 3703.121001, current state
active+undersized+degraded+remapped+backfilling, last acting [9,60]
pg 14.399 is stuck undersized for 593.828981, current state
active+undersized+degraded+remapped+backfilling, last acting [8,56]
pg 14.39e is stuck unclean for 138.073532, current state
active+remapped+backfill_wait, last acting [44,3,60]
pg 14.3a5 is stuck undersized for 161.266621, current state
active+undersized+degraded+remapped+backfill_wait, last acting [63,28]
pg 14.3a8 is stuck undersized for 161.269743, current state
active+undersized+degraded+remapped+backfilling, last acting [46,59]
pg 14.3b2 is stuck undersized for 7932.093694, current state
active+undersized+degraded+remapped+backfilling, last acting [12,1]
pg 14.3ca is stuck undersized for 724.899933, current state
active+undersized+degraded+remapped+backfilling, last acting [9,31]
pg 14.3cc is stuck undersized for 115.185775, current state
active+undersized+degraded+remapped+backfill_wait, last acting [42,9]
pg 14.3ea is stuck unclean for 8143.713642, current state
active+remapped+backfilling, last acting [13,57,62]
pg 14.3ed is stuck undersized for 361.684445, current state
active+undersized+degraded+remapped+backfilling, last acting [13,5]
pg 14.3f2 is stuck undersized for 437.859470, current state
active+undersized+degraded+remapped+backfilling, last acting [11,31]
pg 14.3f3 is stuck undersized for 363.686095, current state
active+undersized+degraded+remapped+backfilling, last acting [12,44]
pg 14.3fd is stuck undersized for 437.859446, current state
active+undersized+degraded+remapped+backfill_wait, last acting [19,57]
pg 35.1e is active+recovery_wait+degraded, acting [2,52,41]
pg 39.11 is active+recovery_wait+degraded, acting [15,19,53]
pg 40.1 is active+recovery_wait+degraded, acting [13,41,52]
pg 41.b is active+recovery_wait+degraded, acting [56,52,29]
REQUEST_STUCK 4017 stuck requests are blocked > 4096 sec
207 ops are blocked > 33554.4 sec
3769 ops are blocked > 16777.2 sec
41 ops are blocked > 8388.61 sec
osd.21 has stuck requests > 33554.4 sec
The down OSDs don't start and we observed the following errors in logs:
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)
1: (()+0xa07bb4) [0x561641bdebb4]
2: (()+0x110c0) [0x7f108f4c30c0]
3: (gsignal()+0xcf) [0x7f108e48afcf]
4: (abort()+0x16a) [0x7f108e48c3fa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x561641c2652e]
6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
const&)+0x434) [0x5616418a5964]
7: (PastIntervals::check_new_interval(int, int, std::vector<int,
std::allocator<int> > const&, std::vector<int, std::allocator<int> >
const&, int, int, std::vector<int, std::allocator<int> > const&,
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x395)
[0x561641882ae5]
8: (OSD::build_past_intervals_parallel()+0xc59) [0x56164163b9e9]
9: (OSD::load_pgs()+0x147b) [0x56164163e27b]
10: (OSD::init()+0x2227) [0x5616416565b7]
11: (main()+0x2eb8) [0x561641568d38]
12: (__libc_start_main()+0xf1) [0x7f108e4782b1]
13: (_start()+0x2a) [0x5616415f2a0a]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- begin dump of recent events ---
0> 2017-11-10 15:14:26.032876 7f1091ee2e40 -1 *** Caught signal
(Aborted) **
in thread 7f1091ee2e40 thread_name:ceph-osd
ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
luminous (rc)
1: (()+0xa07bb4) [0x561641bdebb4]
2: (()+0x110c0) [0x7f108f4c30c0]
3: (gsignal()+0xcf) [0x7f108e48afcf]
4: (abort()+0x16a) [0x7f108e48c3fa]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x561641c2652e]
6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
const&)+0x434) [0x5616418a5964]
7: (PastIntervals::check_new_interval(int, int, std::vector<int,
std::allocator<int> > const&, std::vector<int, std::allocator<int> >
const&, int, int, std::vector<int, std::allocator<int> > const&,
std::vector<int, std::allocator<int> > const&, unsigned int, unsigned
int, std::shared_ptr<OSDMap const>, std::shared_ptr<OSDMap const>, pg_t,
IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x395)
[0x561641882ae5]
8: (OSD::build_past_intervals_parallel()+0xc59) [0x56164163b9e9]
9: (OSD::load_pgs()+0x147b) [0x56164163e27b]
10: (OSD::init()+0x2227) [0x5616416565b7]
11: (main()+0x2eb8) [0x561641568d38]
12: (__libc_start_main()+0xf1) [0x7f108e4782b1]
13: (_start()+0x2a) [0x5616415f2a0a]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 0 none
0/ 0 lockdep
0/ 0 context
0/ 0 crush
0/ 0 mds
0/ 0 mds_balancer
0/ 0 mds_locker
0/ 0 mds_log
0/ 0 mds_log_expire
0/ 0 mds_migrator
0/ 0 buffer
0/ 0 timer
0/ 0 filer
0/ 0 striper
0/ 0 objecter
0/ 0 rados
0/ 0 rbd
0/ 5 rbd_mirror
0/ 0 rbd_replay
0/ 0 journaler
0/ 0 objectcacher
0/ 0 client
0/ 0 osd
0/ 0 optracker
0/ 0 objclass
0/ 0 filestore
0/ 0 journal
0/ 0 ms
0/ 0 mon
0/ 0 monc
0/ 0 paxos
0/ 0 tp
0/ 0 auth
0/ 0 crypto
0/ 0 finisher
0/ 0 heartbeatmap
0/ 0 perfcounter
0/ 0 rgw
0/ 0 civetweb
0/ 0 javaclient
0/ 0 asok
0/ 0 throttle
0/ 0 refs
0/ 0 xio
0/ 0 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.26.log
--- end dump of recent events ---
This seems to be exactly the same bug as
http://tracker.ceph.com/issues/21142.
Can somebody help me please ?
Thanks in advance :-)
Rémi
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com