Problems after upgrade to Jewel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

We had our cluster failed again this morning. It took almost the day to stabilize.Here are some problems in OSD's logs we have encountered :

Some OSDs refused to start :

-1> 2016-11-23 15:50:49.507588 7f5f5b7a5800 -1 osd.27 196774 load_pgs: have pgid 9.268 at epoch 196874, but missing map.  Crashing.

0> 2016-11-23 15:50:49.509473 7f5f5b7a5800 -1 osd/OSD.cc: In function 'void OSD::load_pgs()' thread 7f5f5b7a5800 time 2016-11-23 15:50:49.507597 osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")

 

ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f5f5c1d35b5]

2: (OSD::load_pgs()+0x1f07) [0x7f5f5bb53b57]

3: (OSD::init()+0x2086) [0x7f5f5bb64e56]

4: (main()+0x2c55) [0x7f5f5bac8be5]

5: (__libc_start_main()+0xf5) [0x7f5f586b7b15]

6: (()+0x353009) [0x7f5f5bb13009]

NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

 

--- logging levels ---

   0/ 5 none

   0/ 1 lockdep

   0/ 1 context

   1/ 1 crush

   1/ 5 mds

 

We finaly arrived to start them by removing the PG without map of the OSD when it was in "active+clean" state on the cluster. We used for this the ceph-objectstore-tool


Some OSDs who start but suicid after 3 mn :

 

-5> 2016-11-23 15:32:28.488489 7fbe411ff700  5 osd.24 197525 heartbeat: osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-4> 2016-11-23 15:32:30.188632 7fbe411ff700  5 osd.24 197525 heartbeat: osd_stat(1883 GB used, 3703 GB avail, 5587 GB total, peers []/[] op hist [])

-3> 2016-11-23 15:32:32.678977 7fbe67ce3700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5ad4b700' had timed out after 60

-2> 2016-11-23 15:32:32.679010 7fbe67ce3700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5b54c700' had timed out after 60

-1> 2016-11-23 15:32:32.679016 7fbe67ce3700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fbe5b54c700' had suicide timed out after 180

 0> 2016-11-23 15:32:32.680982 7fbe67ce3700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fbe67ce3700 time 2016-11-23 15:32:32.679038 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 

We have no explanation for this


Some Dead-Locks :

 

                        The OSD.32 refused to start because the PG 9.72 has no map


                        -1> 2016-11-23 15:02:32.675283 7f2b74492800 -1 osd.32 196921 load_pgs: have pgid 9.72 at epoch 196975, but missing map.  Crashing.      

0> 2016-11-23 15:02:32.676710 7f2b74492800 -1 osd/OSD.cc: In function 'void OSD::load_pgs()' thread 7f2b74492800 time 2016-11-23 15:02:32.675293 osd/OSD.cc: 3186: FAILED assert(0 == "Missing map in load_pgs")

 

                        PG 9.72 is in state « down+peering » and waiting for OSD.32 to start or to be set "lost"

 

We have to declare to OSD lost because of these deadlocks

 

Some messages in log we'd like to have an explanation :

 

2016-11-23 15:02:32.202200 7f2b74492800  0 set uid:gid to 167:167 (ceph:ceph)

2016-11-23 15:02:32.202240 7f2b74492800  0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-osd, pid 1718781

2016-11-23 15:02:32.203557 7f2b74492800  0 pidfile_write: ignore empty --pid-file

2016-11-23 15:02:32.231376 7f2b74492800  0 filestore(/var/lib/ceph/osd/ceph-32) backend xfs (magic 0x58465342)

2016-11-23 15:02:32.231935 7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option

2016-11-23 15:02:32.231941 7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config option

2016-11-23 15:02:32.231961 7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: splice is supported

2016-11-23 15:02:32.232777 7f2b74492800  0 genericfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_features: syncfs(2) syscall fully supported (by glibc and kernel)

2016-11-23 15:02:32.232824 7f2b74492800  0 xfsfilestorebackend(/var/lib/ceph/osd/ceph-32) detect_feature: extsize is disabled by conf

2016-11-23 15:02:32.233704 7f2b74492800  1 leveldb: Recovering log #102027

2016-11-23 15:02:32.234863 7f2b74492800  1 leveldb: Delete type=3 #102026

 

2016-11-23 15:02:32.234926 7f2b74492800  1 leveldb: Delete type=0 #102027

 

2016-11-23 15:02:32.235444 7f2b74492800  0 filestore(/var/lib/ceph/osd/ceph-32) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled

2016-11-23 15:02:32.237484 7f2b74492800  1 journal _open /var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238027 7f2b74492800  1 journal _open /var/lib/ceph/osd/ceph-32/journal fd 18: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 1

2016-11-23 15:02:32.238992 7f2b74492800  1 filestore(/var/lib/ceph/osd/ceph-32) upgrade

2016-11-23 15:02:32.239727 7f2b74492800  0 <cls> cls/hello/cls_hello.cc:305: loading cls_hello

2016-11-23 15:02:32.240153 7f2b74492800  0 <cls> cls/cephfs/cls_cephfs.cc:202: loading cephfs_size_scan

2016-11-23 15:02:32.245427 7f2b74492800  0 osd.32 196921 crush map has features 1107558400, adjusting msgr requires for clients

2016-11-23 15:02:32.245435 7f2b74492800  0 osd.32 196921 crush map has features 1107558400 was 8705, adjusting msgr requires for mons

2016-11-23 15:02:32.245439 7f2b74492800  0 osd.32 196921 crush map has features 1107558400, adjusting msgr requires for osds

2016-11-23 15:02:32.639715 7f2b74492800  0 osd.32 196921 load_pgs


If you have some answers ... i'll take them


Vincent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux