Help needed ! cluster unstable after upgrade from Hammer to Jewel

Vincent Godin <vince.mlist@xxxxxxxxx> · Wed, 16 Nov 2016 19:01:52 +0100

Hello,

We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2 (initial was hammer 0.94.5) but we have still some big problems on our production environment :
some ceph filesystem are not mounted at startup and we have to mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'"

some OSD start but are in timeout as soon as they start for a pretty long time (more than 5 mn)
016-11-15 01:46:26.625945 7f79db91e800  0 osd.32 191438 done with init, starting boot process
2016-11-15 01:47:28.344996 7f79d61f7700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60
2016-11-15 01:47:33.345098 7f79d61f7700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60
...

these OSD take very long time to stop
we just loosed one OSD and the cluster is unable to stabilize and some OSDs go Up and Down. The cluster is in ERR state and can not serve production environment
we are in jewel 10.2.2 on CentOS 7.2 kernel 3.10.0-327.36.3.el7.x86_64
Help will be apreciate !
Vincent

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com