Re: Help needed ! cluster unstable after upgrade from Hammer to Jewel

Nick Fisk <nick@xxxxxxxxxx> · Wed, 16 Nov 2016 20:54:36 -0000

From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Vincent Godin
Sent: 16 November 2016 18:02
To: ceph-users <ceph-users@xxxxxxxxxxxxxx>
Subject:  Help needed ! cluster unstable after upgrade from Hammer to Jewel

Hello,
We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2 (initial was hammer 0.94.5) but we have still some big problems on our production environment :
some ceph filesystem are not mounted at startup and we have to mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'"
some OSD start but are in timeout as soon as they start for a pretty long time (more than 5 mn)
016-11-15 01:46:26.625945 7f79db91e800  0 osd.32 191438 done with init, starting boot process
2016-11-15 01:47:28.344996 7f79d61f7700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60
2016-11-15 01:47:33.345098 7f79d61f7700  1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60
...
these OSD take very long time to stop
we just loosed one OSD and the cluster is unable to stabilize and some OSDs go Up and Down. The cluster is in ERR state and can not serve production environment
we are in jewel 10.2.2 on CentOS 7.2 kernel 3.10.0-327.36.3.el7.x86_64
Help will be apreciate !
Vincent
Can you see anything that might indicate why the OSD’s are taking a long time to start up. Ie any errors in the kernel log or do the disks look like they are working very hard when the OSD tries to start?
Also a quick google of “heartbeat_map is_healthy 'FileStore::op_tp thread” brings up several past threads, it might be worth seeing if any of them had a solution.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com