On 29/11/17 17:24, Matthew Vernon wrote: > We have a 3,060 OSD ceph cluster (running Jewel > 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by > which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that > host), and having ops blocking on it for some time. It will then behave > for a bit, and then go back to doing this. > > It's always the same OSD, and we've tried replacing the underlying disk. > > The logs have lots of entries of the form > > 2017-11-29 17:18:51.097230 7fcc06919700 1 heartbeat_map is_healthy > 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15 Thanks for the various helpful suggestions in response to this. In case you're interested (and for the archives), the answer was Gnocchi - all the slow requests were for a particular pool, which is where we were sending metrics from an OpenStack instance. Gnocchi less than version 4.0 is, I learn, known to kill ceph because its use of librados is rather badly behaved. Newer OpenStacks (from Pike, I think) use a newer Gnocchi. We stopped ceilometer and gnocchi, and the problem went away. Thanks are due to RedHat support for finding this for us :) Regards, Matthew -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com