Re: One OSD misbehaving (spinning 100% CPU, delayed ops)

Matthew Vernon <mv3@xxxxxxxxxxxx> · Thu, 14 Dec 2017 13:26:54 +0000

On 29/11/17 17:24, Matthew Vernon wrote:

> We have a 3,060 OSD ceph cluster (running Jewel
> 10.2.7-0ubuntu0.16.04.1), and one OSD on one host keeps misbehaving - by
> which I mean it keeps spinning ~100% CPU (cf ~5% for other OSDs on that
> host), and having ops blocking on it for some time. It will then behave
> for a bit, and then go back to doing this.
> 
> It's always the same OSD, and we've tried replacing the underlying disk.
> 
> The logs have lots of entries of the form
> 
> 2017-11-29 17:18:51.097230 7fcc06919700  1 heartbeat_map is_healthy
> 'OSD::osd_op_tp thread 0x7fcc29fec700' had timed out after 15

Thanks for the various helpful suggestions in response to this. In case
you're interested (and for the archives), the answer was Gnocchi - all
the slow requests were for a particular pool, which is where we were
sending metrics from an OpenStack instance. Gnocchi less than version
4.0 is, I learn, known to kill ceph because its use of librados is
rather badly behaved. Newer OpenStacks (from Pike, I think) use a newer
Gnocchi. We stopped ceilometer and gnocchi, and the problem went away.
Thanks are due to RedHat support for finding this for us :)

Regards,

Matthew

-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com