mds replay forever after a power failure

"Su, Zhan" <stugrammer@xxxxxxxxx> · Wed, 12 Jul 2017 16:38:47 -0700

Hi,
I have a single-node Ceph cluster. After a power failure, the mds cell stuck on replaying the logs and cephfs stops working.

ceph -s gives:

health HEALTH_WARN
            mds cluster is degraded
            noscrub,nodeep-scrub flag(s) set
     monmap e4: 1 mons at {redbox=10.1.0.194:6789/0}
            election epoch 41, quorum 0 redbox
      fsmap e68: 1/1/1 up {0=redbox=up:replay}
        mgr no daemons active 
     osdmap e369: 10 osds: 10 up, 10 in
            flags noscrub,nodeep-scrub,sortbitwise,require_jewel_osds,require_kraken_osds
      pgmap v528504: 870 pgs, 11 pools, 24050 GB data, 6030 kobjects
            48200 GB used, 44938 GB / 93139 GB avail
                 870 active+clean

And log:

2017-07-12 16:29:15.763089 7fe48be74380  0 ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7), process ceph-mds, pid 6925
2017-07-12 16:29:15.764181 7fe48be74380  0 pidfile_write: ignore empty --pid-file
2017-07-12 16:29:15.766205 7fe48be74380 -1 WARNING: the following dangerous and experimental features are enabled: bluestore
2017-07-12 16:29:20.243839 7fe48534e700  1 mds.0.68 handle_mds_map i am now mds.0.68
2017-07-12 16:29:20.243852 7fe48534e700  1 mds.0.68 handle_mds_map state change up:boot --> up:replay
2017-07-12 16:29:20.243880 7fe48534e700  1 mds.0.68 replay_start
2017-07-12 16:29:20.243886 7fe48534e700  1 mds.0.68  recovery set is
2017-07-12 16:29:20.243892 7fe48534e700  1 mds.0.68  waiting for osdmap 369 (which blacklists prior instance)
2017-07-12 16:31:10.333118 7fe4872ef700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d6000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:10.533666 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d7800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:10.934493 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d9000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:11.244383 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d9000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:11.445010 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d6000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:11.848886 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d7800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:12.650186 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af3675800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:31:14.252245 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - 

And above line just repeatedly printed out.

I tried to bump logging level by:

ceph tell mds.redbox injectargs --debug_journaler 20
ceph tell mds.redbox injectargs --debug_mds 20
ceph tell mds.redbox injectargs --debug_mds_log 20

And log:

2017-07-12 16:33:36.675521 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d9000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:33:51.676004 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d7800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:34:06.679509 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d6000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:34:21.680076 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af3678800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:34:36.684580 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af36d7800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:34:40.401344 7fe48534e700  5 mds.redbox ms_handle_reset on 10.1.0.194:0/193586809
2017-07-12 16:34:40.401354 7fe48534e700  3 mds.redbox ms_handle_reset closing connection for session client.144148 10.1.0.194:0/193586809
2017-07-12 16:34:40.776533 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:34:40.776631 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:34:40.776703 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:34:43.777983 7fe482348700 10 mds.beacon.redbox _send up:replay seq 83
2017-07-12 16:34:43.778910 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 83 rtt 0.000903
2017-07-12 16:34:45.776623 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:34:45.776693 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:34:45.776744 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:34:47.778093 7fe482348700 10 mds.beacon.redbox _send up:replay seq 84
2017-07-12 16:34:47.779079 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 84 rtt 0.000962
2017-07-12 16:34:50.776727 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:34:50.776801 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:34:50.776850 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:34:51.685148 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af3745000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:34:51.778187 7fe482348700 10 mds.beacon.redbox _send up:replay seq 85
2017-07-12 16:34:51.778863 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 85 rtt 0.000657
2017-07-12 16:34:55.776822 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:34:55.776884 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:34:55.776923 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:34:55.778278 7fe482348700 10 mds.beacon.redbox _send up:replay seq 86
2017-07-12 16:34:55.778949 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 86 rtt 0.000654
2017-07-12 16:34:59.778370 7fe482348700 10 mds.beacon.redbox _send up:replay seq 87
2017-07-12 16:34:59.779040 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 87 rtt 0.000651
2017-07-12 16:35:00.776923 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:35:00.776993 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:35:00.777036 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:35:03.778470 7fe482348700 10 mds.beacon.redbox _send up:replay seq 88
2017-07-12 16:35:03.779226 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 88 rtt 0.000736
2017-07-12 16:35:05.777045 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:35:05.777104 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:35:05.777142 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:35:06.686551 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af3746800 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:35:07.778569 7fe482348700 10 mds.beacon.redbox _send up:replay seq 89
2017-07-12 16:35:07.779191 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 89 rtt 0.000604
2017-07-12 16:35:10.777127 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:35:10.777183 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:35:10.777220 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:35:11.778672 7fe482348700 10 mds.beacon.redbox _send up:replay seq 90
2017-07-12 16:35:11.779282 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 90 rtt 0.000592
2017-07-12 16:35:15.777223 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:35:15.777278 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:35:15.777316 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:35:15.778789 7fe482348700 10 mds.beacon.redbox _send up:replay seq 91
2017-07-12 16:35:15.779710 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 91 rtt 0.000905
2017-07-12 16:35:19.778907 7fe482348700 10 mds.beacon.redbox _send up:replay seq 92
2017-07-12 16:35:19.779649 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 92 rtt 0.000706
2017-07-12 16:35:20.777333 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:35:20.777418 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2017-07-12 16:35:20.777463 7fe482b49700 20 mds.beacon.redbox 0 slow request found
2017-07-12 16:35:21.687091 7fe487af0700  0 -- 10.1.0.194:6800/422236473 >> - conn(0x559af374f000 :6800 s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to send and in the half  accept state just closed
2017-07-12 16:35:23.778991 7fe482348700 10 mds.beacon.redbox _send up:replay seq 93
2017-07-12 16:35:23.779626 7fe48534e700 10 mds.beacon.redbox handle_mds_beacon up:replay seq 93 rtt 0.000618
2017-07-12 16:35:25.777436 7fe482b49700 20 mds.0.bal get_load no root, no load
2017-07-12 16:35:25.777500 7fe482b49700 15 mds.0.bal get_load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0.08>
2017-07-12 16:35:25.777542 7fe482b49700 20 mds.beacon.redbox 0 slow request found

I cannot see what goes wrong here.

I looked at http://docs.ceph.com/docs/hammer/cephfs/disaster-recovery/ and the first two command simply get stuck. I have not tried truncating journal just yet.

Any help would be appreciated. Thanks!

Regards,
Zhan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com