Problems after upgrade to Jewel

Vincent Godin <vince.mlist@xxxxxxxxx> · Wed, 23 Nov 2016 18:57:58 +0100

Hello,

We had our cluster failed again this morning. It took almost the day to stabilize.Here are some problems in OSD's logs we have encountered : 

Some OSDs refused to start :

-1> 2016-11-23
15:50:49.507588 7f5f5b7a5800 -1 osd.27 196774 load_pgs: have pgid 9.268 at epoch 196874, but missing
map.  Crashing.

0>
2016-11-23 15:50:49.509473 7f5f5b7a5800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f5f5b7a5800 time 2016-11-23 15:50:49.507597
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")

ceph version 10.2.2
(45107e21c568dd033c2f0a3107dec8f0b0e58374)

1: (ceph::__ceph_assert_fail(char
const*, char const*, int, char const*)+0x85) [0x7f5f5c1d35b5]

2: (OSD::load_pgs()+0x1f07)
[0x7f5f5bb53b57]

3: (OSD::init()+0x2086)
[0x7f5f5bb64e56]

4: (main()+0x2c55)
[0x7f5f5bac8be5]

5: (__libc_start_main()+0xf5)
[0x7f5f586b7b15]

6: (()+0x353009) [0x7f5f5bb13009]

NOTE: a copy of the executable,
or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

We finaly arrived to start them by removing the PG without map of the OSD when it was in "active+clean" state on the cluster. We used for this the ceph-objectstore-tool

Some OSDs who start but suicid after 3 mn :

-5> 2016-11-23
15:32:28.488489 7fbe411ff700  5 osd.24 197525 heartbeat: osd_stat(1883 GB
used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-4> 2016-11-23
15:32:30.188632 7fbe411ff700  5 osd.24 197525 heartbeat: osd_stat(1883 GB
used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-3>
2016-11-23 15:32:32.678977 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5ad4b700' had timed out after 60

-2>
2016-11-23 15:32:32.679010 7fbe67ce3700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fbe5b54c700' had timed out after 60

-1>
2016-11-23 15:32:32.679016 7fbe67ce3700 
1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5b54c700' had suicide
timed out after 180

 0>
2016-11-23 15:32:32.680982 7fbe67ce3700 -1 common/HeartbeatMap.cc: In function
'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*,
time_t)' thread 7fbe67ce3700 time 2016-11-23 15:32:32.679038
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

We have no explanation for this

Some Dead-Locks :

                        The OSD.32 refused to start because the PG 9.72 has no map

                        -1> 2016-11-23 15:02:32.675283 7f2b74492800 -1 osd.32 196921 load_pgs: have pgid 9.72 at epoch 196975, but
missing map.  Crashing.       

0>
2016-11-23 15:02:32.676710 7f2b74492800 -1 osd/OSD.cc: In function 'void
OSD::load_pgs()' thread 7f2b74492800 time 2016-11-23 15:02:32.675293
osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")

                        PG 9.72 is in state « down+peering » and waiting for OSD.32 to start or to be set "lost"

We have to declare to OSD lost because of these deadlocks

Some messages in log we'd like to have an explanation :

2016-11-23 15:02:32.202200
7f2b74492800  0 set uid:gid to 167:167 (ceph:ceph)

2016-11-23 15:02:32.202240
7f2b74492800  0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374),
process ceph-osd, pid 1718781

2016-11-23 15:02:32.203557
7f2b74492800  0 pidfile_write: ignore empty --pid-file

2016-11-23 15:02:32.231376
7f2b74492800  0 filestore(/var/lib/ceph/osd/ceph-32) backend xfs (magic
0x58465342)

2016-11-23 15:02:32.231935
7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32)
detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option

2016-11-23 15:02:32.231941
7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32)
detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole'
config option

2016-11-23 15:02:32.231961
7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32)
detect_features: splice is supported

2016-11-23 15:02:32.232777
7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32)
detect_features: syncfs(2) syscall fully supported (by glibc and kernel)

2016-11-23 15:02:32.232824
7f2b74492800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-32)
detect_feature: extsize is disabled by conf

2016-11-23 15:02:32.233704
7f2b74492800  1 leveldb: Recovering log #102027

2016-11-23 15:02:32.234863
7f2b74492800  1 leveldb: Delete type=3 #102026

2016-11-23 15:02:32.234926
7f2b74492800  1 leveldb: Delete type=0 #102027

2016-11-23 15:02:32.235444
7f2b74492800  0 filestore(/var/lib/ceph/osd/ceph-32) mount: enabling
WRITEAHEAD journal mode: checkpoint is not enabled

2016-11-23 15:02:32.237484
7f2b74492800  1 journal _open /var/lib/ceph/osd/ceph-32/journal fd 18:
5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238027
7f2b74492800  1 journal _open /var/lib/ceph/osd/ceph-32/journal fd 18:
5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238992
7f2b74492800  1 filestore(/var/lib/ceph/osd/ceph-32) upgrade

2016-11-23 15:02:32.239727
7f2b74492800  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello

2016-11-23 15:02:32.240153
7f2b74492800  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading
cephfs_size_scan

2016-11-23 15:02:32.245427 7f2b74492800  0 osd.32
196921 crush map has features 1107558400, adjusting msgr requires for clients

2016-11-23 15:02:32.245435 7f2b74492800  0 osd.32
196921 crush map has features 1107558400 was 8705, adjusting msgr requires for
mons

2016-11-23 15:02:32.245439 7f2b74492800  0 osd.32
196921 crush map has features 1107558400, adjusting msgr requires for osds

2016-11-23 15:02:32.639715
7f2b74492800  0 osd.32 196921 load_pgs

If you have some answers ... i'll take them

Vincent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com