... Trying to send again after reporting bounce backs to dreamhost ... ... Trying to send one more time after seeing mails come through the list today ... Hi all, First off, I should point out that this is a 'small cluster' issue and may well be due to the stretched resources. If I'm doomed to destroying and starting again, fair be it, but I'm interested to see if things can get up and running again. My experimental ceph cluster now has 5 nodes with 3 osds each. Some drives are big, some drives are small. Most are formatted with BTRFS and two are still formatted with XFS, which I intend to remove and recreate with BTRFS at some point. I gather BTRFS isn't entirely stable yet, but compression suits my use-case, so I'm prepared to stick with it while it matures. I had to set the following, to avoid osds dying as the IO was consumed by the snapshot creation and deletion process (as I understand it): filestore btrfs snap = false and the mount options look like this: osd mount options btrfs = rw,noatime,space_cache,user_subvol_rm_allowed,compress-force=lzo Each node is a HP Microserver n36l or n54l, with 8GB of memory, so CPU horsepower is lacking somewhat. Ceph is version 0.80.8, and each node is also a mon. My issue is: After adding the 15th osd, the cluster went into a spiral of destruction, with osds going down one after another. One might go down on occasion, and usually a start of the osd in question will remedy things. This time, though, it hasn't, and the problem appears to have become worse and worse. I've tried starting osds, restarting whole hosts, to no avail. I've brought all osds back 'in' and set noup, nodown and noout. I've ceased rbd activity since it was getting blocked anyway. The cluster appears to now be 'stuck' in this state: cluster e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a health HEALTH_WARN 1 pgs backfill; 45 pgs backfill_toofull; 1969 pgs degraded; 1226 pgs down; 2 pgs incomplete; 1333 pgs peering; 1445 pgs stale; 1336 pgs stuck inactive; 1445 pgs stuck stale; 4198 pgs stuck unclean; recovery 838948/2578420 objects degraded (32.537%); 2 near full osd(s); 8/15 in osds are down; noup,nodown,noout flag(s) set monmap e5: 5 mons at {0=192.168.12.25:6789/0,1=192.168.12.26:6789/0,2=192.168.12.27:6789/0,3= 192.168.12.28:6789/0,4=192.168.12.29:6789/0}, election epoch 2618, quorum 0,1,2,3,4 0,1,2,3,4 osdmap e63276: 15 osds: 7 up, 15 in flags noup,nodown,noout pgmap v3371280: 4288 pgs, 5 pools, 3322 GB data, 835 kobjects 4611 GB used, 871 GB / 5563 GB avail 838948/2578420 objects degraded (32.537%) 3 down+remapped+peering 8 stale+active+degraded+remapped 85 active+clean 1 stale+incomplete 1088 stale+down+peering 642 active+degraded+remapped 1 incomplete 33 stale+remapped+peering 135 down+peering 1 stale+degraded 1 stale+active+degraded+remapped+wait_backfill+backfill_toofull 854 active+remapped 234 stale+active+degraded 4 active+degraded+remapped+backfill_toofull 40 active+remapped+backfill_toofull 1079 active+degraded 5 stale+active+clean 74 stale+peering Take one of the nodes. It holds osds 12 (down & in), 13 (up & in) and 14 (down & in). # ceph osd stat osdmap e63276: 15 osds: 7 up, 15 in flags noup,nodown,noout # ceph daemon osd.12 status no valid command found; 10 closest matches: config show help log dump get_command_descriptions git_version config set <var> <val> [<val>...] version 2 config get <var> 0 admin_socket: invalid command # ceph daemon osd.13 status { "cluster_fsid": "e3dd7a1a-bd5f-43fe-a06f-58e830b93b7a", "osd_fsid": "d7794b10-2366-4c4f-bb4d-5f11098429b6", "whoami": 13, "state": "active", "oldest_map": 48214, "newest_map": 63276, "num_pgs": 790} # ceph daemon osd.14 status admin_socket: exception getting command descriptions: [Errno 111] Connection refused I'm assuming osds 12 and 14 are acting that way because they're not up, but why are they different? In terms of logs, ceph-osd.12.log doesn't go beyond this: 2015-02-22 10:38:29.629407 7fd24952c780 0 ceph version 0.80.8 (69eaad7f8308f21573c604f121956e64679a52a7), process ceph-osd, pid 3813 2015-02-22 10:38:29.639802 7fd24952c780 0 filestore(/var/lib/ceph/osd/ceph-12) mount detected btrfs 2015-02-22 10:38:29.876993 7fd24952c780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features: FIEMAP ioctl is supported and appears to work 2015-02-22 10:38:29.877028 7fd24952c780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2015-02-22 10:38:31.487359 7fd24952c780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_features: syscall(SYS_syncfs, fd) fully supported 2015-02-22 10:38:31.487538 7fd24952c780 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature: CLONE_RANGE ioctl is supported 2015-02-22 10:38:31.996129 7fd24952c780 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature: SNAP_CREATE is supported 2015-02-22 10:38:32.016044 7fd24952c780 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature: SNAP_DESTROY is supported 2015-02-22 10:38:32.016447 7fd24952c780 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature: START_SYNC is supported (transid 388132) 2015-02-22 10:38:32.228338 7fd24952c780 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature: WAIT_SYNC is supported 2015-02-22 10:38:32.315391 7fd24952c780 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-12) detect_feature: SNAP_CREATE_V2 is supported 2015-02-22 10:38:40.051466 7fd24952c780 0 filestore(/var/lib/ceph/osd/ceph-12) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2015-02-22 10:38:48.814208 7fd24952c780 1 journal _open /var/lib/ceph/osd/ceph-12/journal fd 22: 5367660544 bytes, block size 4096 bytes, directio = 1, aio = 1 ceph-osd.13.log seemed to cease earlier today. This osd is up though, so I'm guessing that's ok. ... 2015-02-22 10:39:59.743880 7f1a37eb3700 -1 osd.13 63267 heartbeat_check: no reply from osd.9 ever on either front or back, first ping sent 2015-02-22 10:39:12.431674 (cutoff 2015-02-22 10:39:39.743877) 2015-02-22 10:56:28.549016 7f1a139a0700 0 -- 192.168.12.29:6806/3462 >> 192.168.12.26:6806/3610 pipe(0x38d2f00 sd=86 :43837 s=2 pgs=15 cs=1 l=0 c=0x45de89a0).fault with nothing to send, going to standby 2015-02-22 10:56:28.612828 7f1a1379e700 0 -- 192.168.12.29:6806/3462 >> 192.168.12.25:6801/3290 pipe(0x38d2a00 sd=83 :49115 s=2 pgs=73 cs=1 l=0 c=0x45de82c0).fault with nothing to send, going to standby 2015-02-22 10:58:56.937399 7f1a197bd700 0 log [INF] : 4.4f1 deep-scrub ok 2015-02-22 10:59:31.122396 7f1a0f869700 0 -- 192.168.12.29:6806/3462 >> 192.168.12.26:6806/3610 pipe(0x449d6c80 sd=109 :6806 s=0 pgs=0 cs=0 l=0 c=0x4610e160).accept connect_seq 2 vs existing 1 state standby 2015-02-22 11:00:46.476679 7f1a197bd700 0 log [INF] : 4.753 scrub ok 2015-02-22 11:13:03.189352 7f1a197bd700 0 log [INF] : 3.c2 scrub ok 2015-02-22 11:16:43.317444 7f1a0f869700 0 -- 192.168.12.29:6806/3462 >> 192.168.12.26:6806/3610 pipe(0x449d6c80 sd=109 :6806 s=2 pgs=93 cs=3 l=0 c=0x45de89a0).fault with nothing to send, going to standby 2015-02-22 11:16:43.317591 7f1a1379e700 0 -- 192.168.12.29:6806/3462 >> 192.168.12.25:6801/3290 pipe(0x38d2a00 sd=83 :49210 s=2 pgs=93 cs=3 l=0 c=0x45de82c0).fault with nothing to send, going to standby 2015-02-22 11:24:18.421633 7f1a197bd700 0 log [INF] : 3.12b scrub ok In ceph-osd.14.log, there were a lot of these earlier: 2015-02-22 10:41:00.643830 7fc4b2f2f700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.27:6801/3471 pipe(0xa0822500 sd=102 :58233 s=1 pgs=0 cs=0 l=0 c=0x2acf6f20).connect claims to be 192.168.12.27:6801/3457 not 192.168.12.27:6801/3471 - wrong node! 2015-02-22 10:41:00.643924 7fc4b2f2f700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.27:6801/3471 pipe(0xa0822500 sd=102 :58233 s=1 pgs=0 cs=0 l=0 c=0x2acf6f20).fault with nothing to send, going to standby 2015-02-22 10:41:03.680471 7fc4a26ca700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.27:6806/3653 pipe(0xa0822500 sd=29 :44145 s=1 pgs=0 cs=0 l=0 c=0x58cdb600).connect claims to be 192.168.12.27:6806/3656 not 192.168.12.27:6806/3653 - wrong node! 2015-02-22 10:41:03.680631 7fc4a26ca700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.27:6806/3653 pipe(0xa0822500 sd=29 :44145 s=1 pgs=0 cs=0 l=0 c=0x58cdb600).fault with nothing to send, going to standby 2015-02-22 10:41:03.683499 7fc4b3030700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.26:6801/3331 pipe(0xa0822280 sd=31 :55061 s=1 pgs=0 cs=0 l=0 c=0x4ebc71e0).connect claims to be 192.168.12.26:6801/3345 not 192.168.12.26:6801/3331 - wrong node! 2015-02-22 10:41:03.683576 7fc4b3030700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.26:6801/3331 pipe(0xa0822280 sd=31 :55061 s=1 pgs=0 cs=0 l=0 c=0x4ebc71e0).fault with nothing to send, going to standby 2015-02-22 10:41:03.685207 7fc4a25c9700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.25:6801/3249 pipe(0xa0822c80 sd=32 :49182 s=1 pgs=0 cs=0 l=0 c=0xa0589e40).connect claims to be 192.168.12.25:6801/3290 not 192.168.12.25:6801/3249 - wrong node! 2015-02-22 10:41:03.685246 7fc4a25c9700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.25:6801/3249 pipe(0xa0822c80 sd=32 :49182 s=1 pgs=0 cs=0 l=0 c=0xa0589e40).fault with nothing to send, going to standby 2015-02-22 10:41:04.089452 7fc4a17c1700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.27:6801/3442 pipe(0x3b5f400 sd=110 :58240 s=1 pgs=0 cs=0 l=0 c=0x69922b00).connect claims to be 192.168.12.27:6801/3457 not 192.168.12.27:6801/3442 - wrong node! 2015-02-22 10:41:04.090755 7fc4a17c1700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.27:6801/3442 pipe(0x3b5f400 sd=110 :58240 s=1 pgs=0 cs=0 l=0 c=0x69922b00).fault with nothing to send, going to standby 2015-02-22 10:41:16.373495 7fc4b2f2f700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.28:6811/3880 pipe(0x3b5f680 sd=28 :43896 s=1 pgs=0 cs=0 l=0 c=0x59128580).connect claims to be 0.0.0.0:6811/3825 not 192.168.12.28:6811/3880 - wrong node! 2015-02-22 10:41:16.373584 7fc4b2f2f700 0 -- 0.0.0.0:6801/3175 >> 192.168.12.28:6811/3880 pipe(0x3b5f680 sd=28 :43896 s=1 pgs=0 cs=0 l=0 c=0x59128580).fault with nothing to send, going to standby Then this more recently. -3> 2015-02-22 15:41:09.738800 7fc4a6dd2700 5 osd.14 60983 heartbeat: osd_stat(997 GB used, 826 GB avail, 1858 GB total, peers []/[] op hist []) -2> 2015-02-22 15:41:09.959586 7fc4dc4b5700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fc4cbc6f700' had timed out after 600 -1> 2015-02-22 15:41:09.959619 7fc4dc4b5700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fc4cbc6f700' had suicide timed out after 18000 0> 2015-02-22 15:41:09.979187 7fc4dc4b5700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fc4dc4b5700 time 2015-02-22 15:41:09.960169 common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout") I've upped 'filestore op thread suicide timeout' even higher now and rebooted all nodes to try and start the process again. After a good 90 minutes, I'm seeing the likes of these on another node: osd 0: alternating between two messages? 2015-02-22 17:36:49.885502 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600 2015-02-22 17:36:49.885532 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600 2015-02-22 17:36:54.885663 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600 2015-02-22 17:36:54.885698 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600 2015-02-22 17:36:59.885766 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600 2015-02-22 17:36:59.885787 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600 2015-02-22 17:37:04.885940 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600 2015-02-22 17:37:04.885968 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600 2015-02-22 17:37:09.886072 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcf303700' had timed out after 600 2015-02-22 17:37:09.886103 7fafdf885700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7fafcfd04700' had timed out after 600 Osd 11: alternating between two messages? 2015-02-22 17:36:08.298415 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600 2015-02-22 17:36:13.298487 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600 2015-02-22 17:36:13.298512 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600 2015-02-22 17:36:18.298653 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600 2015-02-22 17:36:18.298740 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600 2015-02-22 17:36:23.298808 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600 2015-02-22 17:36:23.298830 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600 2015-02-22 17:36:28.298921 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600 2015-02-22 17:36:28.298949 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f983098a700' had timed out after 600 2015-02-22 17:36:33.299003 7f983d949700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f982ff89700' had timed out after 600 And osd 3: 2015-02-22 16:05:35.309159 7fe6d4912700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.29:6806/3462 pipe(0x3d1eef00 sd=112 :60063 s=1 pgs=0 cs=0 l=0 c=0x5b689a20).connect claims to be 0.0.0.0:6806/3359 not 192.168.12.29:6806/3462 - wrong node! 2015-02-22 16:05:35.309245 7fe6d4912700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.29:6806/3462 pipe(0x3d1eef00 sd=112 :60063 s=1 pgs=0 cs=0 l=0 c=0x5b689a20).fault with nothing to send, going to standby 2015-02-22 16:05:40.045763 7fe6d4d26700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.26:6806/3610 pipe(0x3d1eea00 sd=85 :58392 s=1 pgs=0 cs=0 l=0 c=0x5b150dc0).connect claims to be 0.0.0.0:6806/3563 not 192.168.12.26:6806/3610 - wrong node! 2015-02-22 16:05:40.045857 7fe6d4d26700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.26:6806/3610 pipe(0x3d1eea00 sd=85 :58392 s=1 pgs=0 cs=0 l=0 c=0x5b150dc0).fault with nothing to send, going to standby 2015-02-22 16:05:43.003097 7fe6d4e27700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.27:6801/3457 pipe(0x3d1ee500 sd=84 :45966 s=1 pgs=0 cs=0 l=0 c=0x5b150b00).connect claims to be 0.0.0.0:6801/3047 not 192.168.12.27:6801/3457 - wrong node! 2015-02-22 16:05:43.003187 7fe6d4e27700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.27:6801/3457 pipe(0x3d1ee500 sd=84 :45966 s=1 pgs=0 cs=0 l=0 c=0x5b150b00).fault with nothing to send, going to standby 2015-02-22 16:06:12.223369 7fe6d4b24700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.27:6806/3656 pipe(0x3d1eec80 sd=73 :41657 s=1 pgs=0 cs=0 l=0 c=0x5b62eb00).connect claims to be 0.0.0.0:6806/3251 not 192.168.12.27:6806/3656 - wrong node! 2015-02-22 16:06:12.223455 7fe6d4b24700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.27:6806/3656 pipe(0x3d1eec80 sd=73 :41657 s=1 pgs=0 cs=0 l=0 c=0x5b62eb00).fault with nothing to send, going to standby 2015-02-22 16:07:50.036067 7fe6d4c25700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.28:6811/3825 pipe(0x3d1ee780 sd=83 :40423 s=1 pgs=0 cs=0 l=0 c=0x5b1511e0).connect claims to be 0.0.0.0:6811/3803 not 192.168.12.28:6811/3825 - wrong node! 2015-02-22 16:07:50.036124 7fe6d4c25700 0 -- 0.0.0.0:6801/3007 >> 192.168.12.28:6811/3825 pipe(0x3d1ee780 sd=83 :40423 s=1 pgs=0 cs=0 l=0 c=0x5b1511e0).fault with nothing to send, going to standby ---- Currently, nodes CPUs certainly seem to be busy, but I don't appear to be seeing any changes in the status of the cluster. I've read about a similar problem a while back where the osds are lagging with their processing of maps, and they simply need to catch up. Is it likely that this is the same? Is there much I can do to help things along, or do I just need to be patient? CPU is being used still, after all, even if the output of 'ceph status' looks the same as it did an hour and a half ago when the machines were restarted. Update: since I'm having to try to send this email again, some time has passed, and the cluster nodes have been up about 21.5 hours. CPU usage is as approx as follows: 192.168.12.25 n54l 50% 192.168.12.26 n54l 1% 192.168.12.27 n36l 10% 192.168.12.28 n36l 1% 192.168.12.29 n54l 10% So one of the nodes is much more busy than the others. But is it actually doing anything? Thanks again, Chris _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com