Re: More than 50% osds down, CPUs still busy; will the cluster recover without help?

"Chris Murray" <chrismurray84@xxxxxxxxx> · Wed, 25 Feb 2015 12:58:12 -0000

Thanks Greg

After seeing some recommendations I found in another thread, my impatience got the better of me, and I've start the process again, but there is some logic, I promise :-)
I've copied the process from Michael Kidd, I believe, and it goes along the lines of:

setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
stopping all OSDs
setting all OSDs down & out
setting various options in ceph.conf to limit backfill activity etc
starting all OSDs
wait until all CPU settles to 0%  <-- I am here
unset the noup flag
wait until all CPU settles to 0%
unset the noin flag
wait until all CPU settles to 0%
unset the nobackfill flag
wait until all CPU settles to 0%
unset the norecover flag
remove options from ceph.conf
unset the noscrub flag
unset the nodeep-scrub flag

Currently, host CPU usage is approx the following, so something's changed, and I'm tempted to leave things a little longer before my next step, just in case CPU does eventually stop spinning. I read reports of things taking "a while" even with modern Xeons, so I suppose it's not outside the realms of possibility that an AMD Neo might take days to work things out. We're up to 24.5 hours now:

192.168.12.25		20%
192.168.12.26		1%
192.168.12.27		15%
192.168.12.28		1%
192.168.12.29		12%

Interesting, as 192.168.12.26 and .28 are the two which stopped spinning before I restarted this process too.

The number of different states is slightly less confusing now, but not by much: :-)

788386/2591752 objects degraded (30.419%)
                  90 stale+active+clean
                   2 stale+down+remapped+peering
                   2 stale+incomplete
                   1 stale+active+degraded+remapped+wait_backfill+backfill_toofull
                   1 stale+degraded
                1255 stale+active+degraded
                  32 stale+remapped+peering
                 773 stale+active+remapped
                   4 stale+active+degraded+remapped+backfill_toofull
                1254 stale+down+peering
                 278 stale+peering
                  33 stale+active+remapped+backfill_toofull
                 563 stale+active+degraded+remapped

> Well, you below indicate that osd.14's log says it crashed on an internal heartbeat timeout (usually, it got stuck waiting for disk IO or the kernel/btrfs hung), so that's why. The osd.12 process exists but isn't "up"; osd.14 doesn't even have a socket to connect to.

Ah, that does make sense, thank you.

> That's not what I'd expect to see (it appears to have timed out and not be recognizing it?) but I don't look at these things too often so maybe that's the normal indication that heartbeats are failing.

I'm not sure what this means either. A google for "heartbeat_map is_healthy  FileStore::op_tp thread had timed out after" doesn't return much, but I did see this quote from Sage on what looks like a similar matter:

> - the filestore op_queue is blocked on the throttler (too much io queued)
> - the commit thread is also waiting for ops to finish
> - i see no actual thread processing the op_queue
> Usually that's because it hit a kernel bug and got killed.  Not sure what 
> else would make that thread disappear...
> sage

Oh!

> Also, you want to find out why they're dying. That's probably the root cause of your issues

I have a sneaking suspicion it's BTRFS, but don't have the evidence or perhaps the knowledge to prove it. If XFS did compression, I'd go with that, but at the moment I need to rely on compression to solve the problem of reclaiming space *within* files which reside on ceph. As far as I remember, overwriting with zeros didn't re-do the thin provisioning on XFS, if that makes sense.

Thanks again,
Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com