More than 50% osds down, CPUs still busy; will the cluster recover without help?

"Chris Murray" <chrismurray84@xxxxxxxxx> · Mon, 23 Feb 2015 16:59:15 -0000

... Trying to send again after reporting bounce backs to dreamhost ...
... Trying to send one more time after seeing mails come through the
list today ...

Hi all,

First off, I should point out that this is a 'small cluster' issue and
may well be due to the stretched resources. If I'm doomed to destroying
and starting again, fair be it, but I'm interested to see if things can
get up and running again.

My experimental ceph cluster now has 5 nodes with 3 osds each. Some
drives are big, some drives are small. Most are formatted with BTRFS and
two are still formatted with XFS, which I intend to remove and recreate
with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but
compression suits my use-case, so I'm prepared to stick with it while it
matures. I had to set the following, to avoid osds dying as the IO was
consumed by the snapshot creation and deletion process (as I understand
it):

    filestore btrfs snap = false

and the mount options look like this:

    osd mount options btrfs =
rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo

Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU
horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is
also a mon.

My issue is: After adding the 15th osd, the cluster went into a spiral
of destruction, with osds going down one after another. One might go
down on occasion, and usually a start of the osd in question will remedy
things. This time, though, it hasn't, and the problem appears to have
become worse and worse. I've tried starting osds, restarting whole
hosts, to no avail. I've brought all osds back 'in' and set noup, nodown
and noout. I've ceased rbd activity since it was getting blocked anyway.
The cluster appears to now be 'stuck' in this state:

    cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a
     health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969
pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445
pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck
unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full
osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set
     monmap e5: 5 mons at
{0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3=
192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618,
quorum 0,1,2,3,4 0,1,2,3,4
     osdmap e63276: 15 osds: 7 up, 15 in
            flags noup,nodown,noout
      pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects
            4611 GB used, 871 GB / 5563 GB avail
            838948/2578420 objects degraded (32.537%)
                   3 down+remapped+peering
                   8 stale+active+degraded+remapped
                  85 active+clean
                   1 stale+incomplete
                1088 stale+down+peering
                 642 active+degraded+remapped
                   1 incomplete
                  33 stale+remapped+peering
                 135 down+peering
                   1 stale+degraded
                   1
stale+active+degraded+remapped+wait_backfill+backfill_toofull
                 854 active+remapped
                 234 stale+active+degraded
                   4 active+degraded+remapped+backfill_toofull
                  40 active+remapped+backfill_toofull
                1079 active+degraded
                   5 stale+active+clean
                  74 stale+peering

Take one of the nodes. It holds osds 12 (down & in), 13 (up & in) and 14
(down & in).

# ceph osd stat
     osdmap e63276: 15 osds: 7 up, 15 in
            flags noup,nodown,noout

# ceph daemon osd.12 status
no valid command found; 10 closest matches:
config show
help
log dump
get_command_descriptions
git_version
config set <var> <val> [<val>...]
version
2
config get <var>
0
admin_socket: invalid command

# ceph daemon osd.13 status
{ "cluster_fsid": "e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a",
  "osd_fsid": "d7794b10-2366-4c4f-bb4d-5f11098429b6",
  "whoami": 13,
  "state": "active",
  "oldest_map": 48214,
  "newest_map": 63276,
  "num_pgs": 790}

# ceph daemon osd.14 status
admin_socket: exception getting command descriptions: [Errno 111]
Connection refused

I'm assuming osds 12 and 14 are acting that way because they're not up,
but why are they different?

In terms of logs, ceph-osd.12.log doesn't go beyond this:
2015-02-22 10:38:29.629407 7fd24952c780  0 ceph version 0.80.8
(69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813
2015-02-22 10:38:29.639802 7fd24952c780  0
filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs
2015-02-22 10:38:29.876993 7fd24952c780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
FIEMAP ioctl is supported and appears to work
2015-02-22 10:38:29.877028 7fd24952c780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2015-02-22 10:38:31.487359 7fd24952c780  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features:
syscall(SYS_syncfs, fd) fully supported
2015-02-22 10:38:31.487538 7fd24952c780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
CLONE_RANGE ioctl is supported
2015-02-22 10:38:31.996129 7fd24952c780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
SNAP_CREATE is supported
2015-02-22 10:38:32.016044 7fd24952c780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
SNAP_DESTROY is supported
2015-02-22 10:38:32.016447 7fd24952c780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
START_SYNC is supported (transid 388132)
2015-02-22 10:38:32.228338 7fd24952c780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
WAIT_SYNC is supported
2015-02-22 10:38:32.315391 7fd24952c780  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature:
SNAP_CREATE_V2 is supported
2015-02-22 10:38:40.051466 7fd24952c780  0
filestore(/var/lib/ceph/osd/ceph-12) mount: enabling WRITEAHEAD journal
mode: checkpoint is not enabled
2015-02-22 10:38:48.814208 7fd24952c780  1 journal _open
/var/lib/ceph/osd/ceph-12/journal fd 22: 5367660544 bytes, block size
4096 bytes, directio = 1, aio = 1

ceph-osd.13.log seemed to cease earlier today. This osd is up though, so
I'm guessing that's ok.

...
2015-02-22 10:39:59.743880 7f1a37eb3700 -1 osd.13 63267 heartbeat_check:
no reply from osd.9 ever on either front or back, first ping sent
2015-02-22 10:39:12.431674 (cutoff 2015-02-22 10:39:39.743877)
2015-02-22 10:56:28.549016 7f1a139a0700  0 -- 192.168.12.29:6806/3462 >>
192.168.12.26:6806/3610 pipe(0x38d2f00 sd=86 :43837 s=2 pgs=15 cs=1 l=0
c=0x45de89a0).fault with nothing to send, going to standby
2015-02-22 10:56:28.612828 7f1a1379e700  0 -- 192.168.12.29:6806/3462 >>
192.168.12.25:6801/3290 pipe(0x38d2a00 sd=83 :49115 s=2 pgs=73 cs=1 l=0
c=0x45de82c0).fault with nothing to send, going to standby
2015-02-22 10:58:56.937399 7f1a197bd700  0 log [INF] : 4.4f1 deep-scrub
ok
2015-02-22 10:59:31.122396 7f1a0f869700  0 -- 192.168.12.29:6806/3462 >>
192.168.12.26:6806/3610 pipe(0x449d6c80 sd=109 :6806 s=0 pgs=0 cs=0 l=0
c=0x4610e160).accept connect_seq 2 vs existing 1 state standby
2015-02-22 11:00:46.476679 7f1a197bd700  0 log [INF] : 4.753 scrub ok
2015-02-22 11:13:03.189352 7f1a197bd700  0 log [INF] : 3.c2 scrub ok
2015-02-22 11:16:43.317444 7f1a0f869700  0 -- 192.168.12.29:6806/3462 >>
192.168.12.26:6806/3610 pipe(0x449d6c80 sd=109 :6806 s=2 pgs=93 cs=3 l=0
c=0x45de89a0).fault with nothing to send, going to standby
2015-02-22 11:16:43.317591 7f1a1379e700  0 -- 192.168.12.29:6806/3462 >>
192.168.12.25:6801/3290 pipe(0x38d2a00 sd=83 :49210 s=2 pgs=93 cs=3 l=0
c=0x45de82c0).fault with nothing to send, going to standby
2015-02-22 11:24:18.421633 7f1a197bd700  0 log [INF] : 3.12b scrub ok

In ceph-osd.14.log, there were a lot of these earlier:

2015-02-22 10:41:00.643830 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.27:6801/3471 pipe(0xa0822500 sd=102 :58233 s=1 pgs=0 cs=0 l=0
c=0x2acf6f20).connect claims to be 192.168.12.27:6801/3457 not
192.168.12.27:6801/3471 - wrong node!
2015-02-22 10:41:00.643924 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.27:6801/3471 pipe(0xa0822500 sd=102 :58233 s=1 pgs=0 cs=0 l=0
c=0x2acf6f20).fault with nothing to send, going to standby
2015-02-22 10:41:03.680471 7fc4a26ca700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.27:6806/3653 pipe(0xa0822500 sd=29 :44145 s=1 pgs=0 cs=0 l=0
c=0x58cdb600).connect claims to be 192.168.12.27:6806/3656 not
192.168.12.27:6806/3653 - wrong node!
2015-02-22 10:41:03.680631 7fc4a26ca700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.27:6806/3653 pipe(0xa0822500 sd=29 :44145 s=1 pgs=0 cs=0 l=0
c=0x58cdb600).fault with nothing to send, going to standby
2015-02-22 10:41:03.683499 7fc4b3030700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.26:6801/3331 pipe(0xa0822280 sd=31 :55061 s=1 pgs=0 cs=0 l=0
c=0x4ebc71e0).connect claims to be 192.168.12.26:6801/3345 not
192.168.12.26:6801/3331 - wrong node!
2015-02-22 10:41:03.683576 7fc4b3030700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.26:6801/3331 pipe(0xa0822280 sd=31 :55061 s=1 pgs=0 cs=0 l=0
c=0x4ebc71e0).fault with nothing to send, going to standby
2015-02-22 10:41:03.685207 7fc4a25c9700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.25:6801/3249 pipe(0xa0822c80 sd=32 :49182 s=1 pgs=0 cs=0 l=0
c=0xa0589e40).connect claims to be 192.168.12.25:6801/3290 not
192.168.12.25:6801/3249 - wrong node!
2015-02-22 10:41:03.685246 7fc4a25c9700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.25:6801/3249 pipe(0xa0822c80 sd=32 :49182 s=1 pgs=0 cs=0 l=0
c=0xa0589e40).fault with nothing to send, going to standby
2015-02-22 10:41:04.089452 7fc4a17c1700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.27:6801/3442 pipe(0x3b5f400 sd=110 :58240 s=1 pgs=0 cs=0 l=0
c=0x69922b00).connect claims to be 192.168.12.27:6801/3457 not
192.168.12.27:6801/3442 - wrong node!
2015-02-22 10:41:04.090755 7fc4a17c1700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.27:6801/3442 pipe(0x3b5f400 sd=110 :58240 s=1 pgs=0 cs=0 l=0
c=0x69922b00).fault with nothing to send, going to standby
2015-02-22 10:41:16.373495 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.28:6811/3880 pipe(0x3b5f680 sd=28 :43896 s=1 pgs=0 cs=0 l=0
c=0x59128580).connect claims to be 0.0.0.0:6811/3825 not
192.168.12.28:6811/3880 - wrong node!
2015-02-22 10:41:16.373584 7fc4b2f2f700  0 -- 0.0.0.0:6801/3175 >>
192.168.12.28:6811/3880 pipe(0x3b5f680 sd=28 :43896 s=1 pgs=0 cs=0 l=0
c=0x59128580).fault with nothing to send, going to standby

Then this more recently. 

    -3> 2015-02-22 15:41:09.738800 7fc4a6dd2700  5 osd.14 60983
heartbeat: osd_stat(997 GB used, 826 GB avail, 1858 GB total, peers
[]/[] op hist [])
    -2> 2015-02-22 15:41:09.959586 7fc4dc4b5700  1 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7fc4cbc6f700' had timed out after
600
    -1> 2015-02-22 15:41:09.959619 7fc4dc4b5700  1 heartbeat_map
is_healthy 'FileStore::op_tp thread 0x7fc4cbc6f700' had suicide timed
out after 18000
     0> 2015-02-22 15:41:09.979187 7fc4dc4b5700 -1
common/HeartbeatMap.cc: In function 'bool
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
time_t)' thread 7fc4dc4b5700 time 2015-02-22 15:41:09.960169
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

I've upped 'filestore op thread suicide timeout' even higher now and
rebooted all nodes to try and start the process again.

After a good 90 minutes, I'm seeing the likes of these on another node:

osd 0: alternating between two messages?

2015-02-22 17:36:49.885502 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
2015-02-22 17:36:49.885532 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
2015-02-22 17:36:54.885663 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
2015-02-22 17:36:54.885698 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
2015-02-22 17:36:59.885766 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
2015-02-22 17:36:59.885787 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
2015-02-22 17:37:04.885940 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
2015-02-22 17:37:04.885968 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600
2015-02-22 17:37:09.886072 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600
2015-02-22 17:37:09.886103 7fafdf885700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600

Osd 11: alternating between two messages?

2015-02-22 17:36:08.298415 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
2015-02-22 17:36:13.298487 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
2015-02-22 17:36:13.298512 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
2015-02-22 17:36:18.298653 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
2015-02-22 17:36:18.298740 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
2015-02-22 17:36:23.298808 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
2015-02-22 17:36:23.298830 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
2015-02-22 17:36:28.298921 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600
2015-02-22 17:36:28.298949 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f983098a700' had timed out after 600
2015-02-22 17:36:33.299003 7f983d949700  1 heartbeat_map is_healthy
'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600

And osd 3:

2015-02-22 16:05:35.309159 7fe6d4912700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.29:6806/3462 pipe(0x3d1eef00 sd=112 :60063 s=1 pgs=0 cs=0 l=0
c=0x5b689a20).connect claims to be 0.0.0.0:6806/3359 not
192.168.12.29:6806/3462 - wrong node!
2015-02-22 16:05:35.309245 7fe6d4912700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.29:6806/3462 pipe(0x3d1eef00 sd=112 :60063 s=1 pgs=0 cs=0 l=0
c=0x5b689a20).fault with nothing to send, going to standby
2015-02-22 16:05:40.045763 7fe6d4d26700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.26:6806/3610 pipe(0x3d1eea00 sd=85 :58392 s=1 pgs=0 cs=0 l=0
c=0x5b150dc0).connect claims to be 0.0.0.0:6806/3563 not
192.168.12.26:6806/3610 - wrong node!
2015-02-22 16:05:40.045857 7fe6d4d26700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.26:6806/3610 pipe(0x3d1eea00 sd=85 :58392 s=1 pgs=0 cs=0 l=0
c=0x5b150dc0).fault with nothing to send, going to standby
2015-02-22 16:05:43.003097 7fe6d4e27700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.27:6801/3457 pipe(0x3d1ee500 sd=84 :45966 s=1 pgs=0 cs=0 l=0
c=0x5b150b00).connect claims to be 0.0.0.0:6801/3047 not
192.168.12.27:6801/3457 - wrong node!
2015-02-22 16:05:43.003187 7fe6d4e27700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.27:6801/3457 pipe(0x3d1ee500 sd=84 :45966 s=1 pgs=0 cs=0 l=0
c=0x5b150b00).fault with nothing to send, going to standby
2015-02-22 16:06:12.223369 7fe6d4b24700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.27:6806/3656 pipe(0x3d1eec80 sd=73 :41657 s=1 pgs=0 cs=0 l=0
c=0x5b62eb00).connect claims to be 0.0.0.0:6806/3251 not
192.168.12.27:6806/3656 - wrong node!
2015-02-22 16:06:12.223455 7fe6d4b24700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.27:6806/3656 pipe(0x3d1eec80 sd=73 :41657 s=1 pgs=0 cs=0 l=0
c=0x5b62eb00).fault with nothing to send, going to standby
2015-02-22 16:07:50.036067 7fe6d4c25700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.28:6811/3825 pipe(0x3d1ee780 sd=83 :40423 s=1 pgs=0 cs=0 l=0
c=0x5b1511e0).connect claims to be 0.0.0.0:6811/3803 not
192.168.12.28:6811/3825 - wrong node!
2015-02-22 16:07:50.036124 7fe6d4c25700  0 -- 0.0.0.0:6801/3007 >>
192.168.12.28:6811/3825 pipe(0x3d1ee780 sd=83 :40423 s=1 pgs=0 cs=0 l=0
c=0x5b1511e0).fault with nothing to send, going to standby

----

Currently, nodes CPUs certainly seem to be busy, but I don't appear to
be seeing any changes in the status of the cluster. I've read about a
similar problem a while back where the osds are lagging with their
processing of maps, and they simply need to catch up. Is it likely that
this is the same? Is there much I can do to help things along, or do I
just need to be patient? CPU is being used still, after all, even if the
output of 'ceph status' looks the same as it did an hour and a half ago
when the machines were restarted.

Update: since I'm having to try to send this email again, some time has
passed, and the cluster nodes have been up about 21.5 hours. CPU usage
is as approx as follows:
192.168.12.25		n54l	50%
192.168.12.26		n54l	1%
192.168.12.27		n36l	10%
192.168.12.28		n36l	1%
192.168.12.29		n54l	10%

So one of the nodes is much more busy than the others. But is it
actually doing anything?

Thanks again,
Chris
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com