Re: More than 50% osds down, CPUs still busy; will the cluster recover without help?

"Chris Murray" <chrismurray84@xxxxxxxxx> · Thu, 26 Feb 2015 21:56:06 -0000

Tackling this on a more piecemeal basis, I've stopped all OSDs, and
started just the three which exist on the first host. 

osd.0 comes up without complaint: "osd.0 63675 done with init, starting
boot process"
osd.3 comes up without complaint: "osd.3 63675 done with init, starting
boot process 
osd.11 is a problematic one.

It does something like this ...

2015-02-26 10:44:50.260593 7f7e23551780  0 ceph version 0.80.8
(69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 305080
2015-02-26 10:44:50.265525 7f7e23551780  0
filestore(/var/lib/ceph/osd/ceph-11) mount detected btrfs
2015-02-26 10:44:51.155501 7f7e23551780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
FIEMAP ioctl is supported and appears to work
2015-02-26 10:44:51.155536 7f7e23551780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-26 10:44:51.433239 7f7e23551780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-02-26 10:44:51.433467 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
CLONE_RANGE ioctl is supported
2015-02-26 10:44:51.644373 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
SNAP_CREATE is supported
2015-02-26 10:44:51.668424 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
SNAP_DESTROY is supported
2015-02-26 10:44:51.668741 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
START_SYNC is supported (transid 43205)
2015-02-26 10:44:51.766577 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
WAIT_SYNC is supported
2015-02-26 10:44:51.814761 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
SNAP_CREATE_V2 is supported
2015-02-26 10:44:52.181382 7f7e23551780  0
filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2015-02-26 10:44:52.221994 7f7e23551780  1 journal _open
/var/lib/ceph/osd/ceph-11/journal fd 22: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2015-02-26 10:44:53.073348 7f7e23551780  1 journal _open
/var/lib/ceph/osd/ceph-11/journal fd 22: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2015-02-26 10:44:53.300055 7f7e23551780  1 journal close
/var/lib/ceph/osd/ceph-11/journal
2015-02-26 10:44:53.388716 7f7e23551780  0
filestore(/var/lib/ceph/osd/ceph-11) mount detected btrfs
2015-02-26 10:44:53.522010 7f7e23551780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
FIEMAP ioctl is supported and appears to work
2015-02-26 10:44:53.522043 7f7e23551780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-26 10:44:53.898121 7f7e23551780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-02-26 10:44:53.898273 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
CLONE_RANGE ioctl is supported
2015-02-26 10:44:54.122056 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
SNAP_CREATE is supported
2015-02-26 10:44:54.142426 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
SNAP_DESTROY is supported
2015-02-26 10:44:54.142722 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
START_SYNC is supported (transid 43212)
2015-02-26 10:44:54.233152 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
WAIT_SYNC is supported
2015-02-26 10:44:54.276165 7f7e23551780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_feature:
SNAP_CREATE_V2 is supported
2015-02-26 10:44:54.550810 7f7e23551780  0
filestore(/var/lib/ceph/osd/ceph-11) mount: WRITEAHEAD journal mode
explicitly enabled in conf
2015-02-26 10:44:54.585626 7f7e23551780  1 journal _open
/var/lib/ceph/osd/ceph-11/journal fd 23: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2015-02-26 10:44:54.589469 7f7e23551780  1 journal _open
/var/lib/ceph/osd/ceph-11/journal fd 23: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1
2015-02-26 10:44:54.590413 7f7e23551780  0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
2015-02-26 10:44:54.604303 7f7e23551780  0 osd.11 61425 crush map has
features 1107558400, adjusting msgr requires for clients
2015-02-26 10:44:54.604323 7f7e23551780  0 osd.11 61425 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
2015-02-26 10:44:54.604329 7f7e23551780  0 osd.11 61425 crush map has
features 1107558400, adjusting msgr requires for osds
2015-02-26 10:44:54.604343 7f7e23551780  0 osd.11 61425 load_pgs

... then pauses for a couple of minutes ...

2015-02-26 10:46:26.313449 7f7e23551780  0 osd.11 61425 load_pgs opened
2426 pgs
2015-02-26 10:46:26.316307 7f7e23551780 -1 osd.11 61425
set_disk_tp_priority(22) Invalid argument: osd_disk_thread_ioprio_class
is  but only the following values are allowed: idle, be or rt
2015-02-26 10:46:26.320354 7f7df1017700  0 osd.11 61425 ignoring osdmap
until we have initialized
2015-02-26 10:46:26.320431 7f7df1017700  0 osd.11 61425 ignoring osdmap
until we have initialized
2015-02-26 10:46:26.334867 7f7df544e700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.27:6801/3135 pipe(0x2c9e500 sd=39 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d1e0).fault with nothing to send, going to standby
2015-02-26 10:46:26.334985 7f7df3378700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.28:6811/3824 pipe(0x2c9ec80 sd=41 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d4a0).fault with nothing to send, going to standby
2015-02-26 10:46:26.335074 7f7df9a90700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.26:6806/3574 pipe(0x2c9f400 sd=43 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d340).fault with nothing to send, going to standby
2015-02-26 10:46:26.335152 7f7df9b91700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.26:6801/3188 pipe(0x2c9f180 sd=44 :0 s=1 pgs=0 cs=0 l=0
c=0xe248d080).fault with nothing to send, going to standby
2015-02-26 10:46:26.341356 7f7df534d700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.25:6801/3318 pipe(0x2c9ef00 sd=40 :52612 s=1 pgs=0 cs=0 l=0
c=0xe248d600).connect claims to be 0.0.0.0:6801/305080 not
192.168.12.25:6801/3318 - wrong node!
2015-02-26 10:46:26.341522 7f7df534d700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.25:6801/3318 pipe(0x2c9ef00 sd=40 :52612 s=1 pgs=0 cs=0 l=0
c=0xe248d600).fault with nothing to send, going to standby
2015-02-26 10:46:26.467208 7f7df2233700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.27:6806/3582 pipe(0xe3566000 sd=148 :0 s=1 pgs=0 cs=0 l=0
c=0xe355ac60).fault with nothing to send, going to standby
2015-02-26 10:46:26.467400 7f7df1ef4700  0 -- 0.0.0.0:6801/305080 >>
192.168.12.29:6801/3169 pipe(0xe3566500 sd=149 :0 s=1 pgs=0 cs=0 l=0
c=0xe355ab00).fault with nothing to send, going to standby

At this point, it's writing to /dev/sdb at up to about 100 IOPs, which
would be sensible for the drive in question: a WD Red 3TB. Over 60
second intervals, iostat looks a bit like this:

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdb               5.17         0.00       660.40          0      39624
sdb               5.40         0.00       688.27          0      41296
sdb               4.62         0.00       564.47          0      33868
sdb               4.95         0.00       655.13          0      39308
sdb               5.33         0.00       688.87          0      41332
sdb               5.05         0.00       663.47          0      39808

Every now and then there'll be a few kb of reads, but the rest of the
time it's just writing. /dev/sdc and /dev/sdd don't appear to be doing
that, despite also being formatted with BTRFS.

Then, after whatever I've set "filestore op thread timeout" to, these
messages start. I've tried 600, 3600, and even a whopping 32400 (9
hours, to fit troubleshooting into what's left of my day here in the UK)

2015-02-26 21:39:03.215672 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
2015-02-26 21:39:03.215699 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
2015-02-26 21:39:08.215855 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
2015-02-26 21:39:08.215883 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400
2015-02-26 21:39:13.216038 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723d298700' had timed out after 32400
2015-02-26 21:39:13.216066 7f724a58d700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f723dc99700' had timed out after 32400

Note the subtle alternation between two thread numbers. 

Even after the start of these timeouts, /dev/sdb is still having bursts
of write activity, which is interesting.

sdb               5.27         0.00       683.60          0      41016
sdb               5.20         0.00       660.40          0      39624
sdb               5.17         0.00       639.67          0      38380
sdb               5.00         0.53       617.33         32      37040

Can anyone comment on what might be causing this error for this osd?
Many years ago, when ZFS was in its infancy, I had a dedup disaster
which I thought would never end, but that just needed to do its thing
before the pool came back to life. Could this be a similar scenario
perhaps? Is the activity leading up to something, and BTRFS is slowly
doing what Ceph is asking of it, or is it just going round and round in
circles and I just can't see? :-)

-----Original Message-----
From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
Chris Murray
Sent: 25 February 2015 12:58
To: Gregory Farnum
Cc: ceph-users
Subject: Re:  More than 50% osds down, CPUs still busy;will
the cluster recover without help?

Thanks Greg

After seeing some recommendations I found in another thread, my
impatience got the better of me, and I've start the process again, but
there is some logic, I promise :-) I've copied the process from Michael
Kidd, I believe, and it goes along the lines of:

setting noup, noin, noscrub, nodeep-scrub, norecover, nobackfill
stopping all OSDs setting all OSDs down & out setting various options in
ceph.conf to limit backfill activity etc starting all OSDs wait until
all CPU settles to 0%  <-- I am here unset the noup flag wait until all
CPU settles to 0% unset the noin flag wait until all CPU settles to 0%
unset the nobackfill flag wait until all CPU settles to 0% unset the
norecover flag remove options from ceph.conf unset the noscrub flag
unset the nodeep-scrub flag

Currently, host CPU usage is approx the following, so something's
changed, and I'm tempted to leave things a little longer before my next
step, just in case CPU does eventually stop spinning. I read reports of
things taking "a while" even with modern Xeons, so I suppose it's not
outside the realms of possibility that an AMD Neo might take days to
work things out. We're up to 24.5 hours now:

192.168.12.25		20%
192.168.12.26		1%
192.168.12.27		15%
192.168.12.28		1%
192.168.12.29		12%

Interesting, as 192.168.12.26 and .28 are the two which stopped spinning
before I restarted this process too.

The number of different states is slightly less confusing now, but not
by much: :-)

788386/2591752 objects degraded (30.419%)
                  90 stale+active+clean
                   2 stale+down+remapped+peering
                   2 stale+incomplete
                   1
stale+active+degraded+remapped+wait_backfill+backfill_toofull
                   1 stale+degraded
                1255 stale+active+degraded
                  32 stale+remapped+peering
                 773 stale+active+remapped
                   4 stale+active+degraded+remapped+backfill_toofull
                1254 stale+down+peering
                 278 stale+peering
                  33 stale+active+remapped+backfill_toofull
                 563 stale+active+degraded+remapped

> Well, you below indicate that osd.14's log says it crashed on an
internal heartbeat timeout (usually, it got stuck waiting for disk IO or
the kernel/btrfs hung), so that's why. The osd.12 process exists but
isn't "up"; osd.14 doesn't even have a socket to connect to.

Ah, that does make sense, thank you.

> That's not what I'd expect to see (it appears to have timed out and
not be recognizing it?) but I don't look at these things too often so
maybe that's the normal indication that heartbeats are failing.

I'm not sure what this means either. A google for "heartbeat_map
is_healthy  FileStore::op_tp thread had timed out after" doesn't return
much, but I did see this quote from Sage on what looks like a similar
matter:

> - the filestore op_queue is blocked on the throttler (too much io 
> queued)
> - the commit thread is also waiting for ops to finish
> - i see no actual thread processing the op_queue Usually that's 
> because it hit a kernel bug and got killed.  Not sure what else would 
> make that thread disappear...
> sage

Oh!

> Also, you want to find out why they're dying. That's probably the root

> cause of your issues

I have a sneaking suspicion it's BTRFS, but don't have the evidence or
perhaps the knowledge to prove it. If XFS did compression, I'd go with
that, but at the moment I need to rely on compression to solve the
problem of reclaiming space *within* files which reside on ceph. As far
as I remember, overwriting with zeros didn't re-do the thin provisioning
on XFS, if that makes sense.

Thanks again,
Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com