On Tue, Dec 1, 2015 at 10:02 AM, Tom Christensen <pavera@xxxxxxxxx> wrote: > Another thing that we don't quite grasp is that when we see slow requests > now they almost always, probably 95% have the "known_if_redirected" state > set. What does this state mean? Does it indicate we have OSD maps that are > lagging and the cluster isn't really in sync? Could this be the cause of > our growing osdmaps? This is just a flag set on operations by new clients to let the OSD perform more effectively — you don't need to worry about it. I'm not sure why you're getting a bunch of client blacklist operations, but each one will generate a new OSDMap (if nothing else prompts one), yes. -Greg > > -Tom > > > On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul) > <paul.hewlett@xxxxxxxxxxxxxxxxxx> wrote: >> >> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can >> anybody confirm this? >> I could not find any usage in the Ceph source code except that the value >> is set in some of the test software… >> >> Paul >> >> >> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Tom >> Christensen <pavera@xxxxxxxxx> >> Date: Monday, 30 November 2015 at 23:20 >> To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx> >> Subject: Re: Flapping OSDs, Large meta directories in OSDs >> >> What counts as ancient? Concurrent to our hammer upgrade we went from >> 3.16->3.19 on ubuntu 14.04. We are looking to revert to the 3.16 kernel >> we'd been running because we're also seeing an intermittent (its happened >> twice in 2 weeks) massive load spike that completely hangs the osd node >> (we're talking about load averages that hit 20k+ before the box becomes >> completely unresponsive). We saw a similar behavior on a 3.13 kernel, which >> resolved by moving to the 3.16 kernel we had before. I'll try to catch one >> with debug_ms=1 and see if I can see it we're hitting a similar hang. >> >> To your comment about omap, we do have filestore xattr use omap = true in >> our conf... which we believe was placed there by ceph-deploy (which we used >> to deploy this cluster). We are on xfs, but we do take tons of RBD >> snapshots. If either of these use cases will cause lots of osd map size >> then, we may just be exceeding the limits of the number of rbd snapshots >> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster) >> >> An interesting note, we had an OSD flap earlier this morning, and when it >> did, immediately after it came back I checked its meta directory size with >> du -sh, this returned immediately, and showed a size of 107GB. The fact >> that it returned immediately indicated to me that something had just >> recently read through that whole directory and it was all cached in the FS >> cache. Normally a du -sh on the meta directory takes a good 5 minutes to >> return. Anyway, since it dropped this morning its meta directory size >> continues to shrink and is down to 93GB. So it feels like something happens >> that makes the OSD read all its historical maps which results in the OSD >> hanging cause there are a ton of them, and then it wakes up and realizes it >> can delete a bunch of them... >> >> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvanders@xxxxxxxxx> >> wrote: >>> >>> The trick with debugging heartbeat problems is to grep back through the >>> log to find the last thing the affected thread was doing, e.g. is >>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the >>> omap, etc.. >>> >>> I agree this doesn't look to be network related, but if you want to rule >>> it out you should use debug_ms=1. >>> >>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and >>> similarly started getting slow requests. To make a long story short, our >>> issue turned out to be sendmsg blocking (very rarely), probably due to an >>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature >>> of this was 900s of slow requests, then an ms log showing "initiating >>> reconnect". Until we got the kernel upgraded everywhere, we used a >>> workaround of ms tcp read timeout = 60. >>> So, check your kernels, and upgrade if they're ancient. Latest el6 >>> kernels work for us. >>> >>> Otherwise, those huge osd leveldb's don't look right. (Unless you're >>> using tons and tons of omap...) And it kinda reminds me of the other problem >>> we hit after the hammer upgrade, namely the return of the ever growing mon >>> leveldb issue. The solution was to recreate the mons one by one. Perhaps >>> you've hit something similar with the OSDs. debug_osd=10 might be good >>> enough to see what the osd is doing, maybe you need debug_filestore=10 also. >>> If that doesn't show the problem, bump those up to 20. >>> >>> Good luck, >>> >>> Dan >>> >>> On 30 Nov 2015 20:56, "Tom Christensen" <pavera@xxxxxxxxx> wrote: >>> > >>> > We recently upgraded to 0.94.3 from firefly and now for the last week >>> > have had intermittent slow requests and flapping OSDs. We have been unable >>> > to nail down the cause, but its feeling like it may be related to our >>> > osdmaps not getting deleted properly. Most of our osds are now storing over >>> > 100GB of data in the meta directory, almost all of that is historical osd >>> > maps going back over 7 days old. >>> > >>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD >>> > cluster), the rebalance took about 36 hours, and it completed 10 days ago. >>> > Since that time the cluster has been HEALTH_OK and all pgs have been >>> > active+clean except for when we have an OSD flap. >>> > >>> > When the OSDs flap they do not crash and restart, they just go >>> > unresponsive for 1-3 minutes, and then come back alive all on their own. >>> > They get marked down by peers, and cause some peering and then they just >>> > come back rejoin the cluster and continue on their merry way. >>> > >>> > We see a bunch of this in the logs while the OSD is catatonic: >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166 >>> > 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' >>> > had timed out after 15 >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176 >>> > 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping >>> > ping request >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210 >>> > 7f5b04e7c700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' >>> > had timed out after 15 >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218 >>> > 7f5b04e7c700 10 osd.1191 1203850 internal heartbeat not healthy, dropping >>> > ping request >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288 >>> > 7f5b03679700 1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700' >>> > had timed out after 15 >>> > >>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293 >>> > 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping >>> > ping request >>> > >>> > >>> > I have a chunk of logs at debug 20/5, not sure if I should have done >>> > just 20... It's pretty hard to catch, we have to basically see the slow >>> > requests and get debug logging set in about a 5-10 second window before the >>> > OSD stops responding to the admin socket... >>> > >>> > As networking is almost always the cause of flapping OSDs we have >>> > tested the network quite extensively. It hasn't changed physically since >>> > before the hammer upgrade, and was performing well. We have done large >>> > amounts of ping tests and have not seen a single dropped packet between osd >>> > nodes or between osd nodes and mons. >>> > >>> > I don't see any error packets or drops on switches either. >>> > >>> > Ideas? >>> > >>> > >>> > _______________________________________________ >>> > ceph-users mailing list >>> > ceph-users@xxxxxxxxxxxxxx >>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > >> >> > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com