mds stuck in clientreplay state after failover

Tren Blackburn <tren@xxxxxxxxxxxxxxx> · Tue, 25 Sep 2012 10:22:33 -0700

Hi List;

I'm having an issue where the mds failed over between two nodes, and
now is stuck in "clientreplay" state. It's been like this for several
hours. Here are some details about the environment:

mds/mon server sap:

sap ceph # emerge --info
Portage 2.1.10.65 (default/linux/amd64/10.0, gcc-4.5.3,
glibc-2.14.1-r3, 3.5.3-gentoo x86_64)
=================================================================
System uname: Linux-3.5.3-gentoo-x86_64-Intel-R-_Xeon-R-_CPU_E5-2620_0_@_2.00GHz-with-gentoo-2.1
Timestamp of tree: Sun, 23 Sep 2012 18:00:01 +0000
app-shells/bash:          4.2_p20
dev-lang/python:          2.7.3-r2, 3.1.3-r1, 3.2.3
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.1-r1
sys-apps/openrc:          0.9.8.4
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.65-r1
sys-devel/automake:       1.11.1
sys-devel/binutils:       2.21.1-r1
sys-devel/gcc:            4.4.5, 4.5.3-r2
sys-devel/gcc-config:     1.6
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r1
sys-kernel/linux-headers: 3.4 (virtual/os-headers)
sys-libs/glibc:           2.14.1-r3

mds/mon server fern:

fern ceph # emerge --info
Portage 2.1.10.65 (default/linux/amd64/10.0, gcc-4.5.3,
glibc-2.14.1-r3, 3.5.3-gentoo x86_64)
=================================================================
System uname: Linux-3.5.3-gentoo-x86_64-Intel-R-_Xeon-R-_CPU_E5-2620_0_@_2.00GHz-with-gentoo-2.1
Timestamp of tree: Mon, 24 Sep 2012 18:00:01 +0000
app-shells/bash:          4.2_p20
dev-lang/python:          2.7.3-r2, 3.1.3-r1, 3.2.3
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.1-r1
sys-apps/openrc:          0.9.8.4
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.65-r1
sys-devel/automake:       1.11.1
sys-devel/binutils:       2.21.1-r1
sys-devel/gcc:            4.4.5, 4.5.3-r2
sys-devel/gcc-config:     1.6
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r1
sys-kernel/linux-headers: 3.4 (virtual/os-headers)
sys-libs/glibc:           2.14.1-r3

mds/mon/osd server ocr46-ire

ocr46-ire ceph # emerge --info
Portage 2.1.10.65 (default/linux/amd64/10.0, gcc-4.4.5,
glibc-2.14.1-r3, 3.5.3-gentoo x86_64)
=================================================================
System uname: Linux-3.5.3-gentoo-x86_64-Intel-R-_Xeon-R-_CPU_E5-2620_0_@_2.00GHz-with-gentoo-2.1
Timestamp of tree: Mon, 24 Sep 2012 18:00:01 +0000
app-shells/bash:          4.2_p20
dev-lang/python:          2.7.3-r2, 3.1.3-r1, 3.2.3
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.1-r1
sys-apps/openrc:          0.9.8.4
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.65-r1
sys-devel/automake:       1.11.1
sys-devel/binutils:       2.21.1-r1
sys-devel/gcc:            4.4.5, 4.5.3-r2
sys-devel/gcc-config:     1.6
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r1
sys-kernel/linux-headers: 3.4 (virtual/os-headers)
sys-libs/glibc:           2.14.1-r3

osd servers ocr31-ire - ocr46-ire

Portage 2.1.10.65 (default/linux/amd64/10.0, gcc-4.4.5,
glibc-2.14.1-r3, 3.5.3-gentoo x86_64)
=================================================================
System uname: Linux-3.5.3-gentoo-x86_64-Intel-R-_Xeon-R-_CPU_E5-2620_0_@_2.00GHz-with-gentoo-2.1
Timestamp of tree: Mon, 24 Sep 2012 18:00:01 +0000
app-shells/bash:          4.2_p20
dev-lang/python:          2.7.3-r2, 3.2.3
dev-util/pkgconfig:       0.26
sys-apps/baselayout:      2.1-r1
sys-apps/openrc:          0.9.8.4
sys-apps/sandbox:         2.5
sys-devel/autoconf:       2.65-r1
sys-devel/automake:       1.11.1
sys-devel/binutils:       2.21.1-r1
sys-devel/gcc:            4.5.3-r2
sys-devel/gcc-config:     1.6
sys-devel/libtool:        2.4-r1
sys-devel/make:           3.82-r1
sys-kernel/linux-headers: 3.4 (virtual/os-headers)
sys-libs/glibc:           2.14.1-r3

All ceph servers are running ceph-0.51. Here is the output of ceph -s:

ocr31-ire ~ # ceph -s
   health HEALTH_OK
   monmap e1: 3 mons at
{fern=10.87.1.88:6789/0,ocr46=10.87.1.104:6789/0,sap=10.87.1.87:6789/0},
election epoch 92, quorum 0,1,2 fern,ocr46,sap
   osdmap e60: 192 osds: 192 up, 192 in
    pgmap v47728: 73728 pgs: 73728 active+clean; 290 GB data, 2794 GB
used, 283 TB / 286 TB avail
   mdsmap e19: 1/1/1 up {0=fern=up:clientreplay}, 2 up:standby

Here are the logs from sap, which was the mds master before it was
told to respawn:

2012-09-25 04:45:32.588374 7f2a34a40700  0 mds.0.3 ms_handle_reset on
10.87.1.100:6832/32257
2012-09-25 04:45:32.589064 7f2a34a40700  0 mds.0.3 ms_handle_connect
on 10.87.1.100:6832/32257
2012-09-25 04:45:57.787416 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 04:55:21.718357 7f2a34a40700  0 mds.0.3 ms_handle_reset on
10.87.1.89:6815/8101
2012-09-25 04:55:21.719044 7f2a34a40700  0 mds.0.3 ms_handle_connect
on 10.87.1.89:6815/8101
2012-09-25 04:55:26.758359 7f2a34a40700  0 mds.0.3 ms_handle_reset on
10.87.1.96:6800/6628
2012-09-25 04:55:26.759415 7f2a34a40700  0 mds.0.3 ms_handle_connect
on 10.87.1.96:6800/6628
2012-09-25 05:13:16.367476 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:18:45.177585 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:19:44.911831 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:30:38.178449 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:37:26.597832 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:37:34.088781 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:38:37.548132 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:39:21.528884 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:39:57.791457 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:40:13.579926 7f2a34a40700  0 mds.0.3 handle_mds_beacon
no longer laggy
2012-09-25 05:41:07.598457 7f2a2f709700  0 -- 10.87.1.87:6801/27351 >>
10.87.1.89:0/30567 pipe(0x1db2fc0 sd=100 pgs=2 cs=1 l=0).fault with
nothing to send, going to standby
2012-09-25 05:41:07.603802 7f2a30517700  0 -- 10.87.1.87:6801/27351 >>
10.87.1.94:0/9358 pipe(0x1db2b40 sd=99 pgs=2 cs=1 l=0).fault with
nothing to send, going to standby
2012-09-25 05:41:07.969148 7f2a34a40700  1 mds.-1.-1 handle_mds_map i
(10.87.1.87:6801/27351) dne in the mdsmap, respawning myself
2012-09-25 05:41:07.969154 7f2a34a40700  1 mds.-1.-1 respawn
2012-09-25 05:41:07.969155 7f2a34a40700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
2012-09-25 05:41:07.969157 7f2a34a40700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
2012-09-25 05:41:07.969159 7f2a34a40700  1 mds.-1.-1  1: '-i'
2012-09-25 05:41:07.969160 7f2a34a40700  1 mds.-1.-1  2: 'sap'
2012-09-25 05:41:07.969161 7f2a34a40700  1 mds.-1.-1  3: '--pid-file'
2012-09-25 05:41:07.969162 7f2a34a40700  1 mds.-1.-1  4:
'/var/run/ceph/mds.sap.pid'
2012-09-25 05:41:07.969163 7f2a34a40700  1 mds.-1.-1  5: '-c'
2012-09-25 05:41:07.969164 7f2a34a40700  1 mds.-1.-1  6: '/etc/ceph/ceph.conf'
2012-09-25 05:41:07.969165 7f2a34a40700  1 mds.-1.-1  cwd /
2012-09-25 05:41:08.003262 7fae819af780  0 ceph version 0.51
(commit:c03ca95d235c9a072dcd8a77ad5274a52e93ae30), process ceph-mds,
pid 1173
2012-09-25 05:41:08.005237 7fae7c9bd700  0 mds.-1.0 ms_handle_connect
on 10.87.1.88:6789/0
2012-09-25 05:41:08.802610 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
2012-09-25 05:41:09.602141 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
2012-09-25 05:41:23.772891 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
2012-09-25 05:41:25.273745 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
2012-09-25 05:41:27.994344 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
2012-09-25 05:41:28.588681 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby
2012-09-25 05:41:42.282588 7fae7c9bd700  1 mds.-1.0 handle_mds_map standby

Why did sap get marked as down in the mdsmap?

Here is the mds logs from fern, the node that took over:

2012-09-25 05:41:09.482638 7f6a921c5700  1 mds.0.4 replay_done
2012-09-25 05:41:09.482652 7f6a921c5700  1 mds.0.4 making mds journal writeable
2012-09-25 05:41:09.483589 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.94:6800/19702
2012-09-25 05:41:09.483602 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.101:6818/7441
2012-09-25 05:41:09.483606 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.90:6812/13521
2012-09-25 05:41:09.483623 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.104:6825/30845
2012-09-25 05:41:09.602861 7f6a95ae2700  1 mds.0.4 handle_mds_map i am
now mds.0.4
2012-09-25 05:41:09.602865 7f6a95ae2700  1 mds.0.4 handle_mds_map
state change up:replay --> up:reconnect
2012-09-25 05:41:09.602867 7f6a95ae2700  1 mds.0.4 reconnect_start
2012-09-25 05:41:09.602869 7f6a95ae2700  1 mds.0.4 reopen_log
2012-09-25 05:41:09.602879 7f6a95ae2700  1 mds.0.server
reconnect_clients -- 2 sessions
2012-09-25 05:41:10.793524 7f6a95ae2700  0 log [DBG] : reconnect by
client.4755 10.87.1.89:0/30567 after 1.190611
2012-09-25 05:41:10.793568 7f6a95ae2700  0 mds.0.server non-auth
10000000004 #10000000004, will pass off to authority
2012-09-25 05:41:10.793594 7f6a95ae2700  0 mds.0.server non-auth
10000000005 #10000000005, will pass off to authority
2012-09-25 05:41:10.793609 7f6a95ae2700  0 mds.0.server non-auth
10000000006 #10000000006, will pass off to authority
2012-09-25 05:41:10.793625 7f6a95ae2700  0 mds.0.server non-auth
100000000d7 #100000000d7, will pass off to authority
2012-09-25 05:41:10.793636 7f6a95ae2700  0 mds.0.server non-auth
100000000d8 #100000000d8, will pass off to authority
2012-09-25 05:41:10.793653 7f6a95ae2700  0 mds.0.server non-auth
100000000d9 #100000000d9, will pass off to authority
2012-09-25 05:41:10.793664 7f6a95ae2700  0 mds.0.server non-auth
100000000da #100000000da, will pass off to authority
2012-09-25 05:41:10.793675 7f6a95ae2700  0 mds.0.server non-auth
100000000db #100000000db, will pass off to authority
...
Goes on for a few hundred thousand lines. It seems like there's a race
here somewhere as it made about 860,000 lines of this "server
non-auth"
...
2012-09-25 05:41:19.474557 7f6a95ae2700  0 mds.0.server non-auth
100002b8a83 #100002b8a83, will pass off to authority
2012-09-25 05:41:19.474568 7f6a95ae2700  0 mds.0.server non-auth
100002b8a84 #100002b8a84, will pass off to authority
2012-09-25 05:41:19.474579 7f6a95ae2700  0 mds.0.server non-auth
100002b8a87 #100002b8a87, will pass off to authority
2012-09-25 05:41:19.474588 7f6a95ae2700  0 mds.0.server non-auth
100002b8a89 #100002b8a89, will pass off to authority
2012-09-25 05:41:19.474605 7f6a95ae2700  0 mds.0.server non-auth
100002b8a8a #100002b8a8a, will pass off to authority
2012-09-25 05:41:19.474626 7f6a95ae2700  0 mds.0.server non-auth
100002b8a8b #100002b8a8b, will pass off to authority
2012-09-25 05:41:19.474637 7f6a95ae2700  0 mds.0.server non-auth
100002b8a8c #100002b8a8c, will pass off to authority
2012-09-25 05:41:19.474652 7f6a95ae2700  0 mds.0.server non-auth
100002b8a8d #100002b8a8d, will pass off to authority
2012-09-25 05:41:19.474667 7f6a95ae2700  0 mds.0.server non-auth
100002b8a8f #100002b8a8f, will pass off to authority
2012-09-25 05:41:19.474689 7f6a95ae2700  0 mds.0.server non-auth
100002b8a96 #100002b8a96, will pass off to authority
2012-09-25 05:41:19.484342 7f6a95ae2700  0 mds.0.server missing
100002ba853 #10000000002/ArchiveOutbound/111/1111968/20110701/385439240.tif
(mine), will load later
2012-09-25 05:41:19.484361 7f6a95ae2700  0 mds.0.server missing
100002ba854 #10000000002/ArchiveOutbound/111/1111968/20110701/385439251.tif
(mine), will load later
2012-09-25 05:41:19.484367 7f6a95ae2700  0 mds.0.server missing
100002ba855 #10000000002/ArchiveOutbound/111/1111968/20110701/385439350.tif
(mine), will load later
2012-09-25 05:41:19.484378 7f6a95ae2700  0 mds.0.server missing
100002ba856 #10000000002/ArchiveOutbound/111/1111968/20110701/385439352.tif
(mine), will load later
2012-09-25 05:41:19.484382 7f6a95ae2700  0 mds.0.server missing
100002ba857 #10000000002/ArchiveOutbound/111/1111968/20110701/385439354.tif
(mine), will load later
2012-09-25 05:41:19.484387 7f6a95ae2700  0 mds.0.server missing
100002ba858 #10000000002/ArchiveOutbound/111/1111968/20110701/385439430.tif
(mine), will load later
2012-09-25 05:41:19.484393 7f6a95ae2700  0 mds.0.server missing
100002ba859 #10000000002/ArchiveOutbound/111/1111968/20110701/.385439444.tif.RxChkh
(mine), will load later
2012-09-25 05:41:19.484401 7f6a95ae2700  1 mds.0.4 reconnect_done
2012-09-25 05:41:23.835106 7f6a95ae2700  0 mds.0.4 handle_mds_beacon
no longer laggy
2012-09-25 05:41:27.994961 7f6a95ae2700  1 mds.0.4 handle_mds_map i am
now mds.0.4
2012-09-25 05:41:27.994965 7f6a95ae2700  1 mds.0.4 handle_mds_map
state change up:reconnect --> up:rejoin
2012-09-25 05:41:27.994968 7f6a95ae2700  1 mds.0.4 rejoin_joint_start
2012-09-25 05:41:28.424292 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.98:6827/21424
2012-09-25 05:41:28.441796 7f6a95ae2700  1 mds.0.4 rejoin_done
2012-09-25 05:41:28.589425 7f6a95ae2700  1 mds.0.4 handle_mds_map i am
now mds.0.4
2012-09-25 05:41:28.589428 7f6a95ae2700  1 mds.0.4 handle_mds_map
state change up:rejoin --> up:clientreplay
2012-09-25 05:41:28.589430 7f6a95ae2700  1 mds.0.4 recovery_done --
successful recovery!
2012-09-25 05:41:28.598110 7f6a95ae2700  1 mds.0.4 clientreplay_start
2012-09-25 05:41:28.613610 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.103:6833/9804
2012-09-25 05:41:28.615111 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.95:6818/8725
2012-09-25 05:41:28.616342 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.98:6812/20834
2012-09-25 05:41:28.618528 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.102:6809/21045
2012-09-25 05:41:28.619851 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.98:6821/21183
2012-09-25 05:41:28.622614 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.89:6809/7883
2012-09-25 05:56:07.824706 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.103:6818/9242
2012-09-25 05:56:07.824827 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.100:6816/31393
2012-09-25 05:56:07.825437 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.103:6818/9242
2012-09-25 05:56:07.834669 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.96:6809/6968
2012-09-25 05:56:07.835271 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.96:6809/6968
2012-09-25 05:56:08.094667 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.101:6824/7669
2012-09-25 05:56:08.095320 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.101:6824/7669
2012-09-25 05:56:08.104670 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.94:6803/19811
2012-09-25 05:56:08.105254 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.94:6803/19811
2012-09-25 05:56:08.464684 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.99:6815/4360
2012-09-25 05:56:08.465283 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.99:6815/4360
2012-09-25 05:56:08.554740 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.101:6830/7898
2012-09-25 05:56:08.554815 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.96:6827/7710
2012-09-25 05:56:08.555387 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.96:6827/7710
2012-09-25 05:56:08.555395 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.101:6830/7898
2012-09-25 05:56:08.564681 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.104:6816/30466
2012-09-25 05:56:08.564759 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.93:6833/28547
2012-09-25 05:56:08.565390 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.104:6816/30466
2012-09-25 05:56:08.565415 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.93:6833/28547
2012-09-25 05:56:08.574702 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.98:6830/21540
2012-09-25 05:56:08.574746 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.104:6804/29957
2012-09-25 05:56:08.575357 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.104:6804/29957
2012-09-25 05:56:08.575366 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.98:6830/21540
2012-09-25 05:56:08.584683 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.101:6806/6990
2012-09-25 05:56:08.585261 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.101:6806/6990
2012-09-25 05:56:08.594677 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.100:6828/31704
2012-09-25 05:56:08.654687 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.102:6833/21939
2012-09-25 05:56:08.655264 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.102:6833/21939
2012-09-25 05:56:08.694689 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.90:6833/14293
2012-09-25 05:56:08.695442 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.90:6833/14293
...
2012-09-25 09:41:30.183380 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.95:6818/8725
2012-09-25 09:41:30.202467 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.89:6809/7883
2012-09-25 09:41:30.203014 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.89:6809/7883
2012-09-25 09:41:30.224668 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.98:6812/20834
2012-09-25 09:41:30.254959 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.98:6812/20834
2012-09-25 09:41:30.274665 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.98:6821/21183
2012-09-25 09:41:30.286990 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.93:6827/28326
2012-09-25 09:41:30.287631 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.93:6827/28326
2012-09-25 09:41:30.294907 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.98:6821/21183
2012-09-25 09:41:30.353745 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.99:6833/5044
2012-09-25 09:41:30.354364 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.99:6833/5044
2012-09-25 09:41:32.934674 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.104:6819/30603
2012-09-25 09:41:32.935642 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.104:6819/30603
2012-09-25 09:41:33.254681 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.100:6854/32722
2012-09-25 09:41:33.279245 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.100:6854/32722
2012-09-25 09:41:40.012905 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.103:6833/9804
2012-09-25 09:41:40.013620 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.103:6833/9804
2012-09-25 09:41:51.574710 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.93:6800/27331
2012-09-25 09:41:51.581248 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.93:6800/27331
2012-09-25 09:42:01.164453 7f6a95ae2700  0 mds.0.4 ms_handle_reset on
10.87.1.94:6815/20249
2012-09-25 09:42:01.165036 7f6a95ae2700  0 mds.0.4 ms_handle_connect
on 10.87.1.94:6815/20249

I enabled a higher level of debugging from ferns mds server:

ceph mds tell 'fern' injectargs "--debug_mds 20"

Here is what the mds logs show on fern:

2012-09-25 09:54:50.017750 7f6a95ae2700 -1 mds.0.4 applying
configuration change: internal_safe_to_start_threads = 'true'

2012-09-25 09:54:52.770191 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16692 (currently up:clientreplay)
2012-09-25 09:54:52.770761 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16692 rtt 0.000562
2012-09-25 09:54:54.723205 7f6a941de700 15 mds.0.bal get_load
mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2012-09-25 09:54:54.723241 7f6a941de700 10 mds.0.locker scatter_tick
2012-09-25 09:54:54.723244 7f6a941de700 10 mds.0.server
find_idle_sessions.  laggy until 2012-09-25 05:41:23.835106
2012-09-25 09:54:54.723256 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567
2012-09-25 09:54:54.723262 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567 and sufficiently new
(2012-09-25 09:54:45.380736)
2012-09-25 09:54:56.770279 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16693 (currently up:clientreplay)
2012-09-25 09:54:56.770782 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16693 rtt 0.000489
2012-09-25 09:54:59.723269 7f6a941de700 15 mds.0.bal get_load
mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2012-09-25 09:54:59.723304 7f6a941de700 10 mds.0.locker scatter_tick
2012-09-25 09:54:59.723306 7f6a941de700 10 mds.0.server
find_idle_sessions.  laggy until 2012-09-25 05:41:23.835106
2012-09-25 09:54:59.723312 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567
2012-09-25 09:54:59.723318 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567 and sufficiently new
(2012-09-25 09:54:45.380736)
2012-09-25 09:55:00.770376 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16694 (currently up:clientreplay)
2012-09-25 09:55:00.770875 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16694 rtt 0.000491
2012-09-25 09:55:04.723314 7f6a941de700 15 mds.0.bal get_load
mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2012-09-25 09:55:04.723340 7f6a941de700 10 mds.0.locker scatter_tick
2012-09-25 09:55:04.723344 7f6a941de700 10 mds.0.server
find_idle_sessions.  laggy until 2012-09-25 05:41:23.835106
2012-09-25 09:55:04.723349 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567
2012-09-25 09:55:04.723355 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567 and sufficiently new
(2012-09-25 09:54:45.380736)
2012-09-25 09:55:04.770462 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16695 (currently up:clientreplay)
2012-09-25 09:55:04.771033 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16695 rtt 0.000566
2012-09-25 09:55:05.381914 7f6a95ae2700 20 mds.0.server get_session
have 0x3419d40 client.4755 10.87.1.89:0/30567 state open
2012-09-25 09:55:05.381922 7f6a95ae2700  3 mds.0.server
handle_client_session client_session(request_renewcaps seq 3303) v1
from client.4755
2012-09-25 09:55:06.426145 7f6a95ae2700 20 mds.0.server get_session
have 0x302b680 client.4746 10.87.1.94:0/9358 state open
2012-09-25 09:55:06.426153 7f6a95ae2700  3 mds.0.server
handle_client_session client_session(request_renewcaps seq 3304) v1
from client.4746
2012-09-25 09:55:08.770543 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16696 (currently up:clientreplay)
2012-09-25 09:55:08.771030 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16696 rtt 0.000479
2012-09-25 09:55:09.723353 7f6a941de700 15 mds.0.bal get_load
mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2012-09-25 09:55:09.723372 7f6a941de700 10 mds.0.locker scatter_tick
2012-09-25 09:55:09.723374 7f6a941de700 10 mds.0.server
find_idle_sessions.  laggy until 2012-09-25 05:41:23.835106
2012-09-25 09:55:09.723379 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567
2012-09-25 09:55:09.723383 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567 and sufficiently new
(2012-09-25 09:55:05.381926)
2012-09-25 09:55:12.770623 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16697 (currently up:clientreplay)
2012-09-25 09:55:12.771158 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16697 rtt 0.000529
2012-09-25 09:55:14.723412 7f6a941de700 15 mds.0.bal get_load
mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2012-09-25 09:55:14.723431 7f6a941de700 10 mds.0.locker scatter_tick
2012-09-25 09:55:14.723433 7f6a941de700 10 mds.0.server
find_idle_sessions.  laggy until 2012-09-25 05:41:23.835106
2012-09-25 09:55:14.723438 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567
2012-09-25 09:55:14.723442 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567 and sufficiently new
(2012-09-25 09:55:05.381926)
2012-09-25 09:55:16.770700 7f6a941de700 10 mds.0.4 beacon_send
up:clientreplay seq 16698 (currently up:clientreplay)
2012-09-25 09:55:16.771198 7f6a95ae2700 10 mds.0.4 handle_mds_beacon
up:clientreplay seq 16698 rtt 0.000491
2012-09-25 09:55:19.723466 7f6a941de700 15 mds.0.bal get_load
mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
2012-09-25 09:55:19.723485 7f6a941de700 10 mds.0.locker scatter_tick
2012-09-25 09:55:19.723487 7f6a941de700 10 mds.0.server
find_idle_sessions.  laggy until 2012-09-25 05:41:23.835106
2012-09-25 09:55:19.723491 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567
2012-09-25 09:55:19.723496 7f6a941de700 20 mds.0.server laggiest
active session is client.4755 10.87.1.89:0/30567 and sufficiently new
(2012-09-25 09:55:05.381926)

The mds is still in clientreplay. I have not attempted to restart the
mds process yet as I wanted to leave it in this state so we can
hopefully provide whatever information is needed to get this resolved.

Please let me know if there is any extra information I can provide
that would be helpful.

Regards,

Tren
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html