Having some problems with my cluster. Wondering if I could get some troubleshooting tips: Running hammer 0.94.5. Small cluster with cache tiering. 3 spinning nodes and 3 SSD nodes. Lots of blocked ops. OSDs are consuming the entirety of the system memory (128GB) and then falling over. Lots of blocked ops, slow requests. Seeing logs like this: 2016-05-24 19:30:09.288941 7f63c126b700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f63cb3cd700' had timed out after 60 2016-05-24 19:30:09.503712 7f63c5273700 0 log_channel(cluster) log [WRN] : map e7779 wrongly marked me down 2016-05-24 19:30:11.190178 7f63cabcc700 0 -- 10.164.245.22:6831/5013886 submit_message MOSDPGPushReply(9.10d 7762 [PushReplyOp(3110010d/rbd_data.9647882ae8944a.00000000000026e7/head//9)]) v2 remote, 10.164.245.23:6821/3028423, failed lossy con, dropping message 0xfc21e00 2016-05-24 19:30:22.832381 7f63bca62700 -1 osd.23 7780 lsb_release_parse - failed to call lsb_release binary with error: (12) Cannot allocate memory Eventually the OSD fails. Cluster is in an unhealthy state. I can set noup, restart the OSDs and get them on the current map, but once I put them back into the cluster, they eventually fail. -H _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com