Blocked ops, OSD consuming memory, hammer

Heath Albritton <halbritt@xxxxxxxx> · Tue, 24 May 2016 14:16:33 -0700

Having some problems with my cluster.  Wondering if I could get some
troubleshooting tips:

Running hammer 0.94.5.  Small cluster with cache tiering.  3 spinning
nodes and 3 SSD nodes.

Lots of blocked ops.  OSDs are consuming the entirety of the system
memory (128GB) and then falling over.  Lots of blocked ops, slow
requests.  Seeing logs like this:

2016-05-24 19:30:09.288941 7f63c126b700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f63cb3cd700' had timed out after 60
2016-05-24 19:30:09.503712 7f63c5273700  0 log_channel(cluster) log
[WRN] : map e7779 wrongly marked me down
2016-05-24 19:30:11.190178 7f63cabcc700  0 --
10.164.245.22:6831/5013886 submit_message MOSDPGPushReply(9.10d 7762
[PushReplyOp(3110010d/rbd_data.9647882ae8944a.00000000000026e7/head//9)])
v2 remote, 10.164.245.23:6821/3028423, failed lossy con, dropping
message 0xfc21e00
2016-05-24 19:30:22.832381 7f63bca62700 -1 osd.23 7780
lsb_release_parse - failed to call lsb_release binary with error: (12)
Cannot allocate memory

Eventually the OSD fails.  Cluster is in an unhealthy state.

I can set noup, restart the OSDs and get them on the current map, but
once I put them back into the cluster, they eventually fail.

-H
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com