Re: Flapping OSDs, Large meta directories in OSDs

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 2 Dec 2015 10:25:09 -0800

On Tue, Dec 1, 2015 at 10:02 AM, Tom Christensen <pavera@xxxxxxxxx> wrote:
> Another thing that we don't quite grasp is that when we see slow requests
> now they almost always, probably 95% have the "known_if_redirected" state
> set.  What does this state mean?  Does it indicate we have OSD maps that are
> lagging and the cluster isn't really in sync?  Could this be the cause of
> our growing osdmaps?

This is just a flag set on operations by new clients to let the OSD
perform more effectively — you don't need to worry about it.

I'm not sure why you're getting a bunch of client blacklist
operations, but each one will generate a new OSDMap (if nothing else
prompts one), yes.
-Greg

>
> -Tom
>
>
> On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul)
> <paul.hewlett@xxxxxxxxxxxxxxxxxx> wrote:
>>
>> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can
>> anybody confirm this?
>> I could not find any usage in the Ceph source code except that the value
>> is set in some of the test software…
>>
>> Paul
>>
>>
>> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Tom
>> Christensen <pavera@xxxxxxxxx>
>> Date: Monday, 30 November 2015 at 23:20
>> To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
>> Subject: Re:  Flapping OSDs, Large meta directories in OSDs
>>
>> What counts as ancient?  Concurrent to our hammer upgrade we went from
>> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel
>> we'd been running because we're also seeing an intermittent (its happened
>> twice in 2 weeks) massive load spike that completely hangs the osd node
>> (we're talking about load averages that hit 20k+ before the box becomes
>> completely unresponsive).  We saw a similar behavior on a 3.13 kernel, which
>> resolved by moving to the 3.16 kernel we had before.  I'll try to catch one
>> with debug_ms=1 and see if I can see it we're hitting a similar hang.
>>
>> To your comment about omap, we do have filestore xattr use omap = true in
>> our conf... which we believe was placed there by ceph-deploy (which we used
>> to deploy this cluster).  We are on xfs, but we do take tons of RBD
>> snapshots.  If either of these use cases will cause lots of osd map size
>> then, we may just be exceeding the limits of the number of rbd snapshots
>> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)
>>
>> An interesting note, we had an OSD flap earlier this morning, and when it
>> did, immediately after it came back I checked its meta directory size with
>> du -sh, this returned immediately, and showed a size of 107GB.  The fact
>> that it returned immediately indicated to me that something had just
>> recently read through that whole directory and it was all cached in the FS
>> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to
>> return.  Anyway, since it dropped this morning its meta directory size
>> continues to shrink and is down to 93GB.  So it feels like something happens
>> that makes the OSD read all its historical maps which results in the OSD
>> hanging cause there are a ton of them, and then it wakes up and realizes it
>> can delete a bunch of them...
>>
>> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvanders@xxxxxxxxx>
>> wrote:
>>>
>>> The trick with debugging heartbeat problems is to grep back through the
>>> log to find the last thing the affected thread was doing, e.g. is
>>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the
>>> omap, etc..
>>>
>>> I agree this doesn't look to be network related, but if you want to rule
>>> it out you should use debug_ms=1.
>>>
>>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and
>>> similarly started getting slow requests. To make a long story short, our
>>> issue turned out to be sendmsg blocking (very rarely), probably due to an
>>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature
>>> of this was 900s of slow requests, then an ms log showing "initiating
>>> reconnect". Until we got the kernel upgraded everywhere, we used a
>>> workaround of ms tcp read timeout = 60.
>>> So, check your kernels, and upgrade if they're ancient. Latest el6
>>> kernels work for us.
>>>
>>> Otherwise, those huge osd leveldb's don't look right. (Unless you're
>>> using tons and tons of omap...) And it kinda reminds me of the other problem
>>> we hit after the hammer upgrade, namely the return of the ever growing mon
>>> leveldb issue. The solution was to recreate the mons one by one. Perhaps
>>> you've hit something similar with the OSDs. debug_osd=10 might be good
>>> enough to see what the osd is doing, maybe you need debug_filestore=10 also.
>>> If that doesn't show the problem, bump those up to 20.
>>>
>>> Good luck,
>>>
>>> Dan
>>>
>>> On 30 Nov 2015 20:56, "Tom Christensen" <pavera@xxxxxxxxx> wrote:
>>> >
>>> > We recently upgraded to 0.94.3 from firefly and now for the last week
>>> > have had intermittent slow requests and flapping OSDs.  We have been unable
>>> > to nail down the cause, but its feeling like it may be related to our
>>> > osdmaps not getting deleted properly.  Most of our osds are now storing over
>>> > 100GB of data in the meta directory, almost all of that is historical osd
>>> > maps going back over 7 days old.
>>> >
>>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD
>>> > cluster), the rebalance took about 36 hours, and it completed 10 days ago.
>>> > Since that time the cluster has been HEALTH_OK and all pgs have been
>>> > active+clean except for when we have an OSD flap.
>>> >
>>> > When the OSDs flap they do not crash and restart, they just go
>>> > unresponsive for 1-3 minutes, and then come back alive all on their own.
>>> > They get marked down by peers, and cause some peering and then they just
>>> > come back rejoin the cluster and continue on their merry way.
>>> >
>>> > We see a bunch of this in the logs while the OSD is catatonic:
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166
>>> > 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700'
>>> > had timed out after 15
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176
>>> > 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping
>>> > ping request
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210
>>> > 7f5b04e7c700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700'
>>> > had timed out after 15
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218
>>> > 7f5b04e7c700 10 osd.1191 1203850 internal heartbeat not healthy, dropping
>>> > ping request
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288
>>> > 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700'
>>> > had timed out after 15
>>> >
>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293
>>> > 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping
>>> > ping request
>>> >
>>> >
>>> > I have a chunk of logs at debug 20/5, not sure if I should have done
>>> > just 20... It's pretty hard to catch, we have to basically see the slow
>>> > requests and get debug logging set in about a 5-10 second window before the
>>> > OSD stops responding to the admin socket...
>>> >
>>> > As networking is almost always the cause of flapping OSDs we have
>>> > tested the network quite extensively.  It hasn't changed physically since
>>> > before the hammer upgrade, and was performing well.  We have done large
>>> > amounts of ping tests and have not seen a single dropped packet between osd
>>> > nodes or between osd nodes and mons.
>>> >
>>> > I don't see any error packets or drops on switches either.
>>> >
>>> > Ideas?
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@xxxxxxxxxxxxxx
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>
>>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com