mds "laggy"

Varun Chandramouli <varun.c37@xxxxxxxxx> · Tue, 26 Mar 2013 19:43:59 +0530

Hi All,
I created a single-node ceph cluster (v0.58) on a vm. Following is my conf file: 

[global]
        auth client required = none
        auth cluster required = none
        auth service required = none

[osd]
        osd journal data = "">        filestore xattr use omap = true
#       osd data = "">

[mon.a]
        host = varunc3-virtual-machine
        mon addr = 10.72.148.201:6789
#       mon data = "">

[mds.a]
        host = varunc3-virtual-machine
#       mds data = "">
[osd.0]
        host = varunc3-virtual-machine

Here is the output of ceph -s:

varunc@varunc3-virtual-machine:~$ ceph -s
   health HEALTH_WARN 392 pgs degraded; 392 pgs stuck unclean; mds a is laggy
   monmap e1: 1 mons at {a=10.72.148.201:6789/0}, election epoch 1, quorum 0 a
   osdmap e45: 1 osds: 1 up, 1 in
    pgmap v177: 392 pgs: 392 active+degraded; 0 bytes data, 13007 MB used, 62744 MB / 79745 MB avail
   mdsmap e946: 1/1/1 up {0=a=up:replay(laggy or crashed)}

I believe due to this, I am not able to mount the ceph file system. I tried going through the mds-log, but could not understand much. I am pasting a part of it which shows errors (should I paste the whole thing?): 

152    -29> 2013-03-26 19:25:58.301027 b4781b40  1 -- 10.72.148.201:6800/16609 <== mon.0 10.72.148.201:6789/0 10 ==== mdsbeacon(4897/a up:replay seq 2 v909)     v2 ==== 103+0+0 (1650300491 0 0) 0x9e2e380 con 0x9e36200
153    -28> 2013-03-26 19:26:00.824303 b1d7ab40  0 -- 10.72.148.201:6800/16609 >> 10.72.148.201:6801/16036 pipe(0x9e2e540 sd=17 :49340 s=1 pgs=0 cs=0 l=1).co    nnect claims to be 10.72.148.201:6801/16695 not 10.72.148.201:6801/16036 - wrong node!
154    -27> 2013-03-26 19:26:00.824384 b1d7ab40  2 -- 10.72.148.201:6800/16609 >> 10.72.148.201:6801/16036 pipe(0x9e2e540 sd=17 :49340 s=1 pgs=0 cs=0 l=1).fa    ult 107: Transport endpoint is not connected
155    -26> 2013-03-26 19:26:02.300921 b257bb40 10 monclient: _send_mon_message to mon.a at 10.72.148.201:6789/0
156    -25> 2013-03-26 19:26:02.300954 b257bb40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6789/0 -- mdsbeacon(4897/a up:replay seq 3 v909) v2 -- ?+0 0    x9e2e8c0 con 0x9e36200
157    -24> 2013-03-26 19:26:02.301264 b4781b40  1 -- 10.72.148.201:6800/16609 <== mon.0 10.72.148.201:6789/0 11 ==== mdsbeacon(4897/a up:replay seq 3 v909)     v2 ==== 103+0+0 (460647212 0 0) 0x9e2ec40 con 0x9e36200
158    -23> 2013-03-26 19:26:06.301163 b257bb40 10 monclient: _send_mon_message to mon.a at 10.72.148.201:6789/0
159    -22> 2013-03-26 19:26:06.301200 b257bb40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6789/0 -- mdsbeacon(4897/a up:replay seq 4 v909) v2 -- ?+0 0    x9e2e700 con 0x9e36200
160    -21> 2013-03-26 19:26:06.301512 b4781b40  1 -- 10.72.148.201:6800/16609 <== mon.0 10.72.148.201:6789/0 12 ==== mdsbeacon(4897/a up:replay seq 4 v909)     v2 ==== 103+0+0 (1900474344 0 0) 0x9e2ea80 con 0x9e36200
161    -20> 2013-03-26 19:26:07.224712 b1d7ab40  0 -- 10.72.148.201:6800/16609 >> 10.72.148.201:6801/16036 pipe(0x9e2e540 sd=17 :49341 s=1 pgs=0 cs=0 l=1).co    nnect claims to be 10.72.148.201:6801/16695 not 10.72.148.201:6801/16036 - wrong node!
162    -19> 2013-03-26 19:26:07.224782 b1d7ab40  2 -- 10.72.148.201:6800/16609 >> 10.72.148.201:6801/16036 pipe(0x9e2e540 sd=17 :49341 s=1 pgs=0 cs=0 l=1).fa    ult 107: Transport endpoint is not connected
163    -18> 2013-03-26 19:26:07.299025 b377fb40 10 monclient: tick
164    -17> 2013-03-26 19:26:07.299047 b377fb40 10 monclient: _check_auth_rotating renewing rotating keys (they expired before 2013-03-26 19:25:37.299046)
165    -16> 2013-03-26 19:26:07.299072 b377fb40 10 monclient: renew subs? (now: 2013-03-26 19:26:07.299071; renew after: 2013-03-26 19:28:24.298915) -- no
166    -15> 2013-03-26 19:26:09.300863 b257bb40 10 monclient: renew_subs
167    -14> 2013-03-26 19:26:09.300892 b257bb40 10 monclient: _send_mon_message to mon.a at 10.72.148.201:6789/0
168    -13> 2013-03-26 19:26:09.300911 b257bb40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6789/0 -- mon_subscribe({mdsmap=910+,monmap=2+,osdmap=42}) v    2 -- ?+0 0x9e35360 con 0x9e36200
169    -12> 2013-03-26 19:26:09.301011 b257bb40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6801/16036 -- ping v1 -- ?+0 0x9e35d80 con 0x9e36400
170    -11> 2013-03-26 19:26:09.301341 b1d7ab40  0 -- 10.72.148.201:6800/16609 >> 10.72.148.201:6801/16036 pipe(0x9e2e540 sd=17 :49342 s=1 pgs=0 cs=0 l=1).co    nnect claims to be 10.72.148.201:6801/16695 not 10.72.148.201:6801/16036 - wrong node!
171    -10> 2013-03-26 19:26:09.301409 b1d7ab40  2 -- 10.72.148.201:6800/16609 >> 10.72.148.201:6801/16036 pipe(0x9e2e540 sd=17 :49342 s=1 pgs=0 cs=0 l=1).fa    ult 107: Transport endpoint is not connected
172     -9> 2013-03-26 19:26:09.301812 b4781b40  1 -- 10.72.148.201:6800/16609 <== mon.0 10.72.148.201:6789/0 13 ==== osd_map(42..45 src has 1..45) v3 ==== 1    167+0+0 (3338985292 0 0) 0x9e2dc60 con 0x9e36200
173     -8> 2013-03-26 19:26:09.301887 b4781b40  1 -- 10.72.148.201:6800/16609 mark_down 0x9e36400 -- 0x9e2e540
174     -7> 2013-03-26 19:26:09.302019 b4781b40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6801/16695 -- osd_op(mds.0.14:1 mds0_inotable [read 0~0] 1.b    852b893 RETRY) v4 -- ?+0 0x9e26900 con 0x9e36700
175     -6> 2013-03-26 19:26:09.302036 b4781b40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6801/16695 -- osd_op(mds.0.14:2 mds0_sessionmap [read 0~0] 1    .3270c60b RETRY) v4 -- ?+0 0x9e26d80 con 0x9e36700
176     -5> 2013-03-26 19:26:09.302051 b4781b40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6801/16695 -- osd_op(mds.0.14:3 mds_anchortable [read 0~0] 1    .a977f6a7 RETRY) v4 -- ?+0 0x9e4d600 con 0x9e36700
177     -4> 2013-03-26 19:26:09.302060 b4781b40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6801/16695 -- osd_op(mds.0.14:4 mds_snaptable [read 0~0] 1.d    90270ad RETRY) v4 -- ?+0 0x9e4d480 con 0x9e36700
178     -3> 2013-03-26 19:26:09.302073 b4781b40  1 -- 10.72.148.201:6800/16609 --> 10.72.148.201:6801/16695 -- osd_op(mds.0.14:5 200.00000000 [read 0~0] 1.84    4f3494 RETRY) v4 -- ?+0 0x9e4d300 con 0x9e36700
179     -2> 2013-03-26 19:26:09.302472 b4781b40  0 mds.0.14 ms_handle_connect on 10.72.148.201:6801/16695
180     -1> 2013-03-26 19:26:09.303976 b4781b40  1 -- 10.72.148.201:6800/16609 <== osd.0 10.72.148.201:6801/16695 1 ==== osd_op_reply(1 mds0_inotable [read 0    ~0] ack = -2 (No such file or directory)) v4 ==== 112+0+0 (3010998831 0 0) 0x9e2d2c0 con 0x9e36700
181      0> 2013-03-26 19:26:09.305543 b4781b40 -1 mds/MDSTable.cc: In function 'void MDSTable::load_2(int, ceph::bufferlist&, Context*)' thread b4781b40 tim    e 2013-03-26 19:26:09.304022
182 mds/MDSTable.cc: 150: FAILED assert(0)

How do I get the mds running?

Regards,
Varun 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Follow-Ups:

Re:  mds "laggy"
From: Varun Chandramouli

Prev by Date:
Re:  SSD Capacity and Partitions for OSD Journals

Next by Date:
Re:  Object location

Previous by thread:
Ceph Crach at sync_thread_timeout after heavy random	writes.

Next by thread:
Re:  mds "laggy"

Index(es):

Date
Thread

[Index of Archives]

[Information on CEPH]

[Linux Filesystem Development]

[Ceph Development]

[Ceph Large]

[Ceph Dev]

[Linux USB Development]

[Video for Linux]

[Linux Audio Users]

[Yosemite News]

[Linux Kernel]

[Linux SCSI]

[xfs]