-1> 2016-11-23 15:50:49.507588 7f5f5b7a5800 -1 osd.27 196774 load_pgs: have pgid 9.268 at epoch 196874, but missing map. Crashing.
0> 2016-11-23 15:50:49.509473 7f5f5b7a5800 -1 osd/OSD.cc: In function 'void OSD::load_pgs()' thread 7f5f5b7a5800 time 2016-11-23 15:50:49.507597 osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f5f5c1d35b5]
2: (OSD::load_pgs()+0x1f07) [0x7f5f5bb53b57]
3: (OSD::init()+0x2086) [0x7f5f5bb64e56]
4: (main()+0x2c55) [0x7f5f5bac8be5]
5: (__libc_start_main()+0xf5) [0x7f5f586b7b15]
6: (()+0x353009) [0x7f5f5bb13009]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
We finaly arrived to start them by removing the PG without map of the OSD when it was in "active+clean" state on the cluster. We used for this the ceph-objectstore-tool
Some OSDs who start but suicid after 3 mn :
-5> 2016-11-23 15:32:28.488489 7fbe411ff700 5 osd.24 197525 heartbeat: osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])
-4> 2016-11-23 15:32:30.188632 7fbe411ff700 5 osd.24 197525 heartbeat: osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])
-3> 2016-11-23 15:32:32.678977 7fbe67ce3700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5ad4b700' had timed out after 60
-2> 2016-11-23 15:32:32.679010 7fbe67ce3700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5b54c700' had timed out after 60
-1> 2016-11-23 15:32:32.679016 7fbe67ce3700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5b54c700' had suicide timed out after 180
0> 2016-11-23 15:32:32.680982 7fbe67ce3700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fbe67ce3700 time 2016-11-23 15:32:32.679038 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
We have no explanation for this
Some Dead-Locks :
The OSD.32 refused to start because the PG 9.72 has no map
-1> 2016-11-23 15:02:32.675283 7f2b74492800 -1 osd.32 196921 load_pgs: have pgid 9.72 at epoch 196975, but missing map. Crashing.
0> 2016-11-23 15:02:32.676710 7f2b74492800 -1 osd/OSD.cc: In function 'void OSD::load_pgs()' thread 7f2b74492800 time 2016-11-23 15:02:32.675293 osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")
PG 9.72 is in state « down+peering » and waiting for OSD.32 to start or to be set "lost"
We have to declare to OSD lost because of these deadlocks
Some messages in log we'd like to have an explanation :
2016-11-23 15:02:32.202200 7f2b74492800 0 set uid:gid to 167:167 (ceph:ceph)
2016-11-23 15:02:32.202240 7f2b74492800 0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1718781
2016-11-23 15:02:32.203557 7f2b74492800 0 pidfile_write: ignore empty --pid-file
2016-11-23 15:02:32.231376 7f2b74492800 0 filestore(/var/lib/ceph/osd/ceph-32) backend xfs (magic 0x58465342)
2016-11-23 15:02:32.231935 7f2b74492800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option
2016-11-23 15:02:32.231941 7f2b74492800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option
2016-11-23 15:02:32.231961 7f2b74492800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: splice is supported
2016-11-23 15:02:32.232777 7f2b74492800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)
2016-11-23 15:02:32.232824 7f2b74492800 0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_feature: extsize is disabled by conf
2016-11-23 15:02:32.233704 7f2b74492800 1 leveldb: Recovering log #102027
2016-11-23 15:02:32.234863 7f2b74492800 1 leveldb: Delete type=3 #102026
2016-11-23 15:02:32.234926 7f2b74492800 1 leveldb: Delete type=0 #102027
2016-11-23 15:02:32.235444 7f2b74492800 0 filestore(/var/lib/ceph/osd/ceph-32) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled
2016-11-23 15:02:32.237484 7f2b74492800 1 journal _open /var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2016-11-23 15:02:32.238027 7f2b74492800 1 journal _open /var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1
2016-11-23 15:02:32.238992 7f2b74492800 1 filestore(/var/lib/ceph/osd/ceph-32) upgrade
2016-11-23 15:02:32.239727 7f2b74492800 0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello
2016-11-23 15:02:32.240153 7f2b74492800 0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan
2016-11-23 15:02:32.245427 7f2b74492800 0 osd.32 196921 crush map has features 1107558400, adjusting msgr requires for clients
2016-11-23 15:02:32.245435 7f2b74492800 0 osd.32 196921 crush map has features 1107558400 was 8705, adjusting msgr requires for mons
2016-11-23 15:02:32.245439 7f2b74492800 0 osd.32 196921 crush map has features 1107558400, adjusting msgr requires for osds
2016-11-23 15:02:32.639715 7f2b74492800 0 osd.32 196921 load_pgs
If you have some answers ... i'll take them
Vincent
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com