Re: Flapping OSDs, Large meta directories in OSDs

Tom Christensen <pavera@xxxxxxxxx> · Thu, 3 Dec 2015 14:31:28 -0700

We were able to prevent the blacklist operations, and now the cluster is much happier, however, the OSDs have not started cleaning up old osd maps after 48 hours.  Is there anything we can do to poke them to get them to start cleaning up old osd maps?

On Wed, Dec 2, 2015 at 11:25 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Tue, Dec 1, 2015 at 10:02 AM, Tom Christensen <pavera@xxxxxxxxx> wrote:

> Another thing that we don't quite grasp is that when we see slow requests

> now they almost always, probably 95% have the "known_if_redirected" state

> set.  What does this state mean?  Does it indicate we have OSD maps that are

> lagging and the cluster isn't really in sync?  Could this be the cause of

> our growing osdmaps?

This is just a flag set on operations by new clients to let the OSD

perform more effectively — you don't need to worry about it.

I'm not sure why you're getting a bunch of client blacklist

operations, but each one will generate a new OSDMap (if nothing else

prompts one), yes.

-Greg

>

> -Tom

>

>

> On Tue, Dec 1, 2015 at 2:35 AM, HEWLETT, Paul (Paul)

> <paul.hewlett@xxxxxxxxxxxxxxxxxx> wrote:

>>

>> I believe that ‘filestore xattr use omap’ is no longer used in Ceph – can

>> anybody confirm this?

>> I could not find any usage in the Ceph source code except that the value

>> is set in some of the test software…

>>

>> Paul

>>

>>

>> From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Tom

>> Christensen <pavera@xxxxxxxxx>

>> Date: Monday, 30 November 2015 at 23:20

>> To: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>

>> Subject: Re:  Flapping OSDs, Large meta directories in OSDs

>>

>> What counts as ancient?  Concurrent to our hammer upgrade we went from

>> 3.16->3.19 on ubuntu 14.04.  We are looking to revert to the 3.16 kernel

>> we'd been running because we're also seeing an intermittent (its happened

>> twice in 2 weeks) massive load spike that completely hangs the osd node

>> (we're talking about load averages that hit 20k+ before the box becomes

>> completely unresponsive).  We saw a similar behavior on a 3.13 kernel, which

>> resolved by moving to the 3.16 kernel we had before.  I'll try to catch one

>> with debug_ms=1 and see if I can see it we're hitting a similar hang.

>>

>> To your comment about omap, we do have filestore xattr use omap = true in

>> our conf... which we believe was placed there by ceph-deploy (which we used

>> to deploy this cluster).  We are on xfs, but we do take tons of RBD

>> snapshots.  If either of these use cases will cause lots of osd map size

>> then, we may just be exceeding the limits of the number of rbd snapshots

>> ceph can handle (we take about 4-5000/day, 1 per RBD in the cluster)

>>

>> An interesting note, we had an OSD flap earlier this morning, and when it

>> did, immediately after it came back I checked its meta directory size with

>> du -sh, this returned immediately, and showed a size of 107GB.  The fact

>> that it returned immediately indicated to me that something had just

>> recently read through that whole directory and it was all cached in the FS

>> cache.  Normally a du -sh on the meta directory takes a good 5 minutes to

>> return.  Anyway, since it dropped this morning its meta directory size

>> continues to shrink and is down to 93GB.  So it feels like something happens

>> that makes the OSD read all its historical maps which results in the OSD

>> hanging cause there are a ton of them, and then it wakes up and realizes it

>> can delete a bunch of them...

>>

>> On Mon, Nov 30, 2015 at 2:11 PM, Dan van der Ster <dvanders@xxxxxxxxx>

>> wrote:

>>>

>>> The trick with debugging heartbeat problems is to grep back through the

>>> log to find the last thing the affected thread was doing, e.g. is

>>> 0x7f5affe72700 stuck in messaging, writing to the disk, reading through the

>>> omap, etc..

>>>

>>> I agree this doesn't look to be network related, but if you want to rule

>>> it out you should use debug_ms=1.

>>>

>>> Last week we upgraded a 1200 osd cluster from firefly to 0.94.5 and

>>> similarly started getting slow requests. To make a long story short, our

>>> issue turned out to be sendmsg blocking (very rarely), probably due to an

>>> ancient el6 kernel (these osd servers had ~800 days' uptime). The signature

>>> of this was 900s of slow requests, then an ms log showing "initiating

>>> reconnect". Until we got the kernel upgraded everywhere, we used a

>>> workaround of ms tcp read timeout = 60.

>>> So, check your kernels, and upgrade if they're ancient. Latest el6

>>> kernels work for us.

>>>

>>> Otherwise, those huge osd leveldb's don't look right. (Unless you're

>>> using tons and tons of omap...) And it kinda reminds me of the other problem

>>> we hit after the hammer upgrade, namely the return of the ever growing mon

>>> leveldb issue. The solution was to recreate the mons one by one. Perhaps

>>> you've hit something similar with the OSDs. debug_osd=10 might be good

>>> enough to see what the osd is doing, maybe you need debug_filestore=10 also.

>>> If that doesn't show the problem, bump those up to 20.

>>>

>>> Good luck,

>>>

>>> Dan

>>>

>>> On 30 Nov 2015 20:56, "Tom Christensen" <pavera@xxxxxxxxx> wrote:

>>> >

>>> > We recently upgraded to 0.94.3 from firefly and now for the last week

>>> > have had intermittent slow requests and flapping OSDs.  We have been unable

>>> > to nail down the cause, but its feeling like it may be related to our

>>> > osdmaps not getting deleted properly.  Most of our osds are now storing over

>>> > 100GB of data in the meta directory, almost all of that is historical osd

>>> > maps going back over 7 days old.

>>> >

>>> > We did do a small cluster change (We added 35 OSDs to a 1445 OSD

>>> > cluster), the rebalance took about 36 hours, and it completed 10 days ago.

>>> > Since that time the cluster has been HEALTH_OK and all pgs have been

>>> > active+clean except for when we have an OSD flap.

>>> >

>>> > When the OSDs flap they do not crash and restart, they just go

>>> > unresponsive for 1-3 minutes, and then come back alive all on their own.

>>> > They get marked down by peers, and cause some peering and then they just

>>> > come back rejoin the cluster and continue on their merry way.

>>> >

>>> > We see a bunch of this in the logs while the OSD is catatonic:

>>> >

>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143166

>>> > 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700'

>>> > had timed out after 15

>>> >

>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143176

>>> > 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping

>>> > ping request

>>> >

>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143210

>>> > 7f5b04e7c700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700'

>>> > had timed out after 15

>>> >

>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143218

>>> > 7f5b04e7c700 10 osd.1191 1203850 internal heartbeat not healthy, dropping

>>> > ping request

>>> >

>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143288

>>> > 7f5b03679700  1 heartbeat_map is_healthy 'OSD::osd_tp thread 0x7f5affe72700'

>>> > had timed out after 15

>>> >

>>> > Nov 30 11:23:38 osd-10 ceph-osd: 2015-11-30 11:22:32.143293

>>> > 7f5b03679700 10 osd.1191 1203850 internal heartbeat not healthy, dropping

>>> > ping request

>>> >

>>> >

>>> > I have a chunk of logs at debug 20/5, not sure if I should have done

>>> > just 20... It's pretty hard to catch, we have to basically see the slow

>>> > requests and get debug logging set in about a 5-10 second window before the

>>> > OSD stops responding to the admin socket...

>>> >

>>> > As networking is almost always the cause of flapping OSDs we have

>>> > tested the network quite extensively.  It hasn't changed physically since

>>> > before the hammer upgrade, and was performing well.  We have done large

>>> > amounts of ping tests and have not seen a single dropped packet between osd

>>> > nodes or between osd nodes and mons.

>>> >

>>> > I don't see any error packets or drops on switches either.

>>> >

>>> > Ideas?

>>> >

>>> >

>>> > _______________________________________________

>>> > ceph-users mailing list

>>> > ceph-users@xxxxxxxxxxxxxx

>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> >

>>

>>

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com