-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 I've seen something similar to this when bringing an OSD back into a cluster that has a lot of I/O that is "close" to the max performance of the drives. For Jewell, there is a "mon osd prime pg temp" [0] which really helped reduce the huge memory usage when an OSD starts up and helped a bit with the slow/blocked I/O too. I created a backport for Hammer that I didn't have problems with, but was rejected to prevent adding new features to Hammer. You could patch Hammer and you only have to run the new code on the monitors to get the benefit.[1] [0] http://docs.ceph.com/docs/jewel/rados/configuration/mon-config-ref/#miscellaneous [1] https://github.com/ceph/ceph/pull/7848 -----BEGIN PGP SIGNATURE----- Version: Mailvelope v1.4.0 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJXRxdoCRDmVDuy+mK58QAAIhYQAJuJ4k7OEOwkAoGnILar M+etyX5nbacWSBxX/NhH7pD++Nmu1JxHa1KM1ymytzEfIwhBenCb/4exdkaQ KpQQREB2STDSCWXvutoAfhc3YsqL0XY/XH2gRMX+crK2NXQoRsEQVzBgWVYh uIZ+wJ2EjzML9nZX5t4Qcxf2o7Z130/FIwcAAx2IkIRex3PsgCWy9t6sVSZB 5zytRQECL+bwa04/Oy+xQqMhekJyLiYkKk0m3c6HI10LOtkoVO/iSj723jMs 5AWazaJOP8A8P5UyzXrDIuM2mcia4yws1INke8r8fcLFtll+rDLIs6icSzsQ aMJHpHu2HNSv+EfqAmX7LpH/ebxcx6CtS51fW2BuQWCszzmeSwbNkx/8VVSS VKgL4ARy1596sdtQVwXuPAQqmV65Cw9K/gP5E/LtISC3tnQM/bugZhGfZNz7 X3ZYy3ujYhrswMsKVw+5i1dPolqxZerIt6rq2r56JK5ZqjNBi5EhUDGsfqiZ LT84jcAhazI+inrIF/O4bg8ili1uNeqhcyNQWnrFawyt3C5MzOD8GSHLgA3A D1IR5I2hpwO3TMqzED8+eQ/Qgd1qF1zMaAkja95aC7mxzfXTsxQj68iIAKUp 47Nwaz4ln2A5f20SQe3W4jxp33MKsAJYej2/xn/B0roxH7ZTAhXlcpYhU8Ni s5aw =X+rk -----END PGP SIGNATURE----- ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Tue, May 24, 2016 at 3:16 PM, Heath Albritton <halbritt@xxxxxxxx> wrote: > Having some problems with my cluster. Wondering if I could get some > troubleshooting tips: > > Running hammer 0.94.5. Small cluster with cache tiering. 3 spinning > nodes and 3 SSD nodes. > > Lots of blocked ops. OSDs are consuming the entirety of the system > memory (128GB) and then falling over. Lots of blocked ops, slow > requests. Seeing logs like this: > > 2016-05-24 19:30:09.288941 7f63c126b700 1 heartbeat_map is_healthy > 'FileStore::op_tp thread 0x7f63cb3cd700' had timed out after 60 > 2016-05-24 19:30:09.503712 7f63c5273700 0 log_channel(cluster) log > [WRN] : map e7779 wrongly marked me down > 2016-05-24 19:30:11.190178 7f63cabcc700 0 -- > 10.164.245.22:6831/5013886 submit_message MOSDPGPushReply(9.10d 7762 > [PushReplyOp(3110010d/rbd_data.9647882ae8944a.00000000000026e7/head//9)]) > v2 remote, 10.164.245.23:6821/3028423, failed lossy con, dropping > message 0xfc21e00 > 2016-05-24 19:30:22.832381 7f63bca62700 -1 osd.23 7780 > lsb_release_parse - failed to call lsb_release binary with error: (12) > Cannot allocate memory > > Eventually the OSD fails. Cluster is in an unhealthy state. > > I can set noup, restart the OSDs and get them on the current map, but > once I put them back into the cluster, they eventually fail. > > > -H > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com