On 11/30/2015 08:56 PM, Tom Christensen wrote: > We recently upgraded to 0.94.3 from firefly and now for the last week > have had intermittent slow requests and flapping OSDs. We have been > unable to nail down the cause, but its feeling like it may be related to > our osdmaps not getting deleted properly. Most of our osds are now > storing over 100GB of data in the meta directory, almost all of that is > historical osd maps going back over 7 days old. > That is odd. Do you have anything special in the ceph.conf regarding the OSDs and how much maps they store? I guess not, but I just wanted to check. There are some settings you might want to play with with such large OSD clusters. Looking at src/common/config_opts.h: OPTION(osd_map_dedup, OPT_BOOL, true) OPTION(osd_map_max_advance, OPT_INT, 150) // make this < cache_size! OPTION(osd_map_cache_size, OPT_INT, 200) OPTION(osd_map_message_max, OPT_INT, 100) // max maps per MOSDMap message OPTION(osd_map_share_max_epochs, OPT_INT, 100) // cap on # of inc maps we send to peers, clients You also might want to take a look at this PDF from Cern: https://cds.cern.ch/record/2015206/files/CephScaleTestMarch2015.pdf > We did do a small cluster change (We added 35 OSDs to a 1445 OSD > cluster), the rebalance took about 36 hours, and it completed 10 days > ago. Since that time the cluster has been HEALTH_OK and all pgs have > been active+clean except for when we have an OSD flap. > > When the OSDs flap they do not crash and restart, they just go > unresponsive for 1-3 minutes, and then come back alive all on their > own. They get marked down by peers, and cause some peering and then > they just come back rejoin the cluster and continue on their merry way. > Do you see any high CPU or memory usage at that point? > We see a bunch of this in the logs while the OSD is catatonic: > > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166 > 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread > 0x7f5affe72700' had timed out after 15 > > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 7f5b03679700 > 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request > > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210 > 7f5b04e7c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread > 0x7f5affe72700' had timed out after 15 > > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 7f5b04e7c700 > 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request > > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288 > 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread > 0x7f5affe72700' had timed out after 15 > > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 7f5b03679700 > 10 osd.1191 1203850 internal heartbeat not healthy, dropping ping request > > > I have a chunk of logs at debug 20/5, not sure if I should have done > just 20... It's pretty hard to catch, we have to basically see the slow > requests and get debug logging set in about a 5-10 second window before > the OSD stops responding to the admin socket... > > As networking is almost always the cause of flapping OSDs we have tested > the network quite extensively. It hasn't changed physically since > before the hammer upgrade, and was performing well. We have done large > amounts of ping tests and have not seen a single dropped packet between > osd nodes or between osd nodes and mons. > > I don't see any error packets or drops on switches either. > > Ideas? > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com