Re: Flapping OSDs, Large meta directories in OSDs

Wido den Hollander <wido@xxxxxxxx> · Mon, 30 Nov 2015 21:41:07 +0100

On 11/30/2015 08:56 PM, Tom Christensen wrote:
> We recently upgraded to 0.94.3 from firefly and now for the last week
> have had intermittent slow requests and flapping OSDs.  We have been
> unable to nail down the cause, but its feeling like it may be related to
> our osdmaps not getting deleted properly.  Most of our osds are now
> storing over 100GB of data in the meta directory, almost all of that is
> historical osd maps going back over 7 days old.
> 

That is odd. Do you have anything special in the ceph.conf regarding the
OSDs and how much maps they store? I guess not, but I just wanted to check.

There are some settings you might want to play with with such large OSD
clusters. Looking at src/common/config_opts.h:

OPTION(osd_map_dedup, OPT_BOOL, true)
OPTION(osd_map_max_advance, OPT_INT, 150) // make this < cache_size!
OPTION(osd_map_cache_size, OPT_INT, 200)
OPTION(osd_map_message_max, OPT_INT, 100)  // max maps per MOSDMap message
OPTION(osd_map_share_max_epochs, OPT_INT, 100)  // cap on # of inc maps
we send to peers, clients

You also might want to take a look at this PDF from Cern:
https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf

> We did do a small cluster change (We added 35 OSDs to a 1445 OSD
> cluster), the rebalance took about 36 hours, and it completed 10 days
> ago.  Since that time the cluster has been HEALTH_OK and all pgs have
> been active+clean except for when we have an OSD flap.
> 
> When the OSDs flap they do not crash and restart, they just go
> unresponsive for 1-3 minutes, and then come back alive all on their
> own.  They get marked down by peers, and cause some peering and then
> they just come back rejoin the cluster and continue on their merry way.  
> 

Do you see any high CPU or memory usage at that point?

> We see a bunch of this in the logs while the OSD is catatonic:
> 
> Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166
> 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
> 0x7f5affe72700' had timed out after 15
> 
> Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 7f5b03679700
> 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request
> 
> Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210
> 7f5b04e7c700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
> 0x7f5affe72700' had timed out after 15
> 
> Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 7f5b04e7c700
> 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request
> 
> Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288
> 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread
> 0x7f5affe72700' had timed out after 15
> 
> Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 7f5b03679700
> 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request
> 
> 
> I have a chunk of logs at debug 20/5, not sure if I should have done
> just 20... It's pretty hard to catch, we have to basically see the slow
> requests and get debug logging set in about a 5-10 second window before
> the OSD stops responding to the admin socket... 
> 
> As networking is almost always the cause of flapping OSDs we have tested
> the network quite extensively.  It hasn't changed physically since
> before the hammer upgrade, and was performing well.  We have done large
> amounts of ping tests and have not seen a single dropped packet between
> osd nodes or between osd nodes and mons.
> 
> I don't see any error packets or drops on switches either.
> 
> Ideas?
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com