Re: MDS crash when client goes to sleep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I tried disconnecting and reconnecting the ethernet.  Didn't crash. I rebooted the client, MDS didn't crash.  I let the client go to sleep.  It didn't crash (I checked with different computer while a sleep...).  When I woke it up and tried to access it (simple ls command) then it crashed.  However two MDS demon is still alive... but failover didn't happen.  Below is log of them. Let me know if you need more info.

Snip of ceph-mds.MDS1.1.log:

   -64> 2014-03-22 17:18:55.426912 7f000a983700 10 monclient: tick
   -63> 2014-03-22 17:18:55.426948 7f000a983700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-22 17:18:25.426945)
   -62> 2014-03-22 17:18:55.426970 7f000a983700 10 monclient: renew subs? (now: 2014-03-22 17:18:55.426969; renew after: 2014-03-22 17:19:05.424545) -- no
   -61> 2014-03-22 17:18:57.735967 7f0009880700  2 mds.0.cache check_memory_usage total 326844, rss 164824, heap 21584, malloc 116104 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 40 inodes have caps, 0 caps, 0 caps per inode
   -60> 2014-03-22 17:18:57.959905 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -59> 2014-03-22 17:18:57.959927 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39763 v6976) v2 -- ?+0 0x308e580 con 0x300c580
   -58> 2014-03-22 17:18:57.960398 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40958 ==== mdsbeacon(6899/MDS1.1 up:active seq 39763 v6976) v2 ==== 108+0+0 (1431336981 0 0) 0x721cb00 con 0x300c580
   -57> 2014-03-22 17:19:01.960055 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -56> 2014-03-22 17:19:01.960083 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39764 v6976) v2 -- ?+0 0x308e2c0 con 0x300c580
   -55> 2014-03-22 17:19:01.960614 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40959 ==== mdsbeacon(6899/MDS1.1 up:active seq 39764 v6976) v2 ==== 108+0+0 (1937787495 0 0) 0x721cdc0 con 0x300c580
   -54> 2014-03-22 17:19:02.736010 7f0009880700  2 mds.0.cache check_memory_usage total 326844, rss 164824, heap 21584, malloc 116104 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 40 inodes have caps, 0 caps, 0 caps per inode
   -53> 2014-03-22 17:19:02.736160 7f0009880700  5 mds.0.bal mds.0 epoch 15900 load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
   -52> 2014-03-22 17:19:03.916566 7f000847a700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
   -51> 2014-03-22 17:19:03.916586 7f000847a700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6805/1823 pipe(0x4cad680 sd=17 :56849 s=1 pgs=0 cs=0 l=1 c=0x4c1f340).failed verifying authorize reply
   -50> 2014-03-22 17:19:03.916613 7f000847a700  2 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6805/1823 pipe(0x4cad680 sd=17 :56849 s=1 pgs=0 cs=0 l=1 c=0x4c1f340).fault 0: Success
   -49> 2014-03-22 17:19:05.427080 7f000a983700 10 monclient: tick
   -48> 2014-03-22 17:19:05.427118 7f000a983700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-22 17:18:35.427114)
   -47> 2014-03-22 17:19:05.427140 7f000a983700 10 monclient: renew subs? (now: 2014-03-22 17:19:05.427138; renew after: 2014-03-22 17:19:05.424545) -- yes
   -46> 2014-03-22 17:19:05.427159 7f000a983700 10 monclient: renew_subs
   -45> 2014-03-22 17:19:05.427170 7f000a983700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -44> 2014-03-22 17:19:05.427191 7f000a983700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mon_subscribe({mdsmap=6977+,monmap=2+,osdmap=821}) v2 -- ?+0 0x4cc9340 con 0x300c580
   -43> 2014-03-22 17:19:05.427641 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40960 ==== mon_subscribe_ack(300s) v1 ==== 20+0+0 (2253672535 0 0) 0x4c0d180 con 0x300c580
   -42> 2014-03-22 17:19:05.427682 7f000b985700 10 monclient: handle_subscribe_ack sent 2014-03-22 17:19:05.427164 renew after 2014-03-22 17:21:35.427164
   -41> 2014-03-22 17:19:05.960243 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -40> 2014-03-22 17:19:05.960278 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39765 v6976) v2 -- ?+0 0x308e000 con 0x300c580
   -39> 2014-03-22 17:19:05.960869 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40961 ==== mdsbeacon(6899/MDS1.1 up:active seq 39765 v6976) v2 ==== 108+0+0 (2138063297 0 0) 0x308fb80 con 0x300c580
   -38> 2014-03-22 17:19:07.736214 7f0009880700  2 mds.0.cache check_memory_usage total 326844, rss 164824, heap 21584, malloc 116104 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 40 inodes have caps, 0 caps, 0 caps per inode
   -37> 2014-03-22 17:19:07.797362 7f000867c700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
   -36> 2014-03-22 17:19:07.797381 7f000867c700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6800/1449 pipe(0x4cac000 sd=18 :34866 s=1 pgs=0 cs=0 l=1 c=0x4c1f4a0).failed verifying authorize reply
   -35> 2014-03-22 17:19:07.797415 7f000867c700  2 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6800/1449 pipe(0x4cac000 sd=18 :34866 s=1 pgs=0 cs=0 l=1 c=0x4c1f4a0).fault 0: Success
   -34> 2014-03-22 17:19:07.823928 7f0007572700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
   -33> 2014-03-22 17:19:07.823943 7f0007572700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6802/2258 pipe(0x4cacc80 sd=19 :35286 s=1 pgs=0 cs=0 l=1 c=0x4c1edc0).failed verifying authorize reply
   -32> 2014-03-22 17:19:07.823961 7f0007572700  2 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6802/2258 pipe(0x4cacc80 sd=19 :35286 s=1 pgs=0 cs=0 l=1 c=0x4c1edc0).fault 0: Success
   -31> 2014-03-22 17:19:09.960440 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -30> 2014-03-22 17:19:09.960472 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39766 v6976) v2 -- ?+0 0x3091b80 con 0x300c580
   -29> 2014-03-22 17:19:09.960975 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40962 ==== mdsbeacon(6899/MDS1.1 up:active seq 39766 v6976) v2 ==== 108+0+0 (1784700203 0 0) 0x725f340 con 0x300c580
   -28> 2014-03-22 17:19:12.736339 7f0009880700  2 mds.0.cache check_memory_usage total 326844, rss 164824, heap 21584, malloc 116104 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 40 inodes have caps, 0 caps, 0 caps per inode
   -27> 2014-03-22 17:19:12.736531 7f0009880700  5 mds.0.bal mds.0 epoch 15901 load mdsload<[0,0 0]/[0,0 0], req 0, hr 0, qlen 0, cpu 0>
   -26> 2014-03-22 17:19:13.960622 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -25> 2014-03-22 17:19:13.960647 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39767 v6976) v2 -- ?+0 0x30918c0 con 0x300c580
   -24> 2014-03-22 17:19:13.961136 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40963 ==== mdsbeacon(6899/MDS1.1 up:active seq 39767 v6976) v2 ==== 108+0+0 (1720735373 0 0) 0x71d38c0 con 0x300c580
   -23> 2014-03-22 17:19:15.427313 7f000a983700 10 monclient: tick
   -22> 2014-03-22 17:19:15.427349 7f000a983700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-22 17:18:45.427345)
   -21> 2014-03-22 17:19:15.427371 7f000a983700 10 monclient: renew subs? (now: 2014-03-22 17:19:15.427370; renew after: 2014-03-22 17:21:35.427164) -- no
   -20> 2014-03-22 17:19:17.736413 7f0009880700  2 mds.0.cache check_memory_usage total 326844, rss 164824, heap 21584, malloc 116104 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 40 inodes have caps, 0 caps, 0 caps per inode
   -19> 2014-03-22 17:19:17.960791 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -18> 2014-03-22 17:19:17.960818 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39768 v6976) v2 -- ?+0 0x3091600 con 0x300c580
   -17> 2014-03-22 17:19:17.961453 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40964 ==== mdsbeacon(6899/MDS1.1 up:active seq 39768 v6976) v2 ==== 108+0+0 (666909135 0 0) 0x71d3b80 con 0x300c580
   -16> 2014-03-22 17:19:18.917895 7f000847a700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decryption
   -15> 2014-03-22 17:19:18.917914 7f000847a700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6805/1823 pipe(0x4cad680 sd=17 :56852 s=1 pgs=0 cs=0 l=1 c=0x4c1f340).failed verifying authorize reply
   -14> 2014-03-22 17:19:18.917939 7f000847a700  2 -- 192.168.1.20:6802/5456 >> 192.168.1.30:6805/1823 pipe(0x4cad680 sd=17 :56852 s=1 pgs=0 cs=0 l=1 c=0x4c1f340).fault 0: Success
   -13> 2014-03-22 17:19:21.960981 7f0009880700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
   -12> 2014-03-22 17:19:21.961021 7f0009880700  1 -- 192.168.1.20:6802/5456 --> 192.168.1.20:6789/0 -- mdsbeacon(6899/MDS1.1 up:active seq 39769 v6976) v2 -- ?+0 0x3091340 con 0x300c580
   -11> 2014-03-22 17:19:21.961630 7f000b985700  1 -- 192.168.1.20:6802/5456 <== mon.0 192.168.1.20:6789/0 40965 ==== mdsbeacon(6899/MDS1.1 up:active seq 39769 v6976) v2 ==== 108+0+0 (724578921 0 0) 0x4c2e000 con 0x300c580
   -10> 2014-03-22 17:19:22.015268 7f0008177700  1 -- 192.168.1.20:6802/5456 >> :/0 pipe(0x303b900 sd=21 :6802 s=0 pgs=0 cs=0 l=0 c=0x300d4a0).accept sd=21 192.168.1.101:39423/0
    -9> 2014-03-22 17:19:22.015348 7f0008177700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.101:0/2977169062 pipe(0x303b900 sd=21 :6802 s=0 pgs=0 cs=0 l=0 c=0x300d4a0).accept peer addr is really 192.168.1.101:0/2977169062 (socket is 192.168.1.101:39423/0)
    -8> 2014-03-22 17:19:22.015812 7f0008177700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.101:0/2977169062 pipe(0x303b900 sd=21 :6802 s=0 pgs=0 cs=0 l=0 c=0x300d4a0).accept connect_seq 0 vs existing 1 state standby
    -7> 2014-03-22 17:19:22.015850 7f0008177700  0 -- 192.168.1.20:6802/5456 >> 192.168.1.101:0/2977169062 pipe(0x303b900 sd=21 :6802 s=0 pgs=0 cs=0 l=0 c=0x300d4a0).accept peer reset, then tried to connect to us, replacing
    -6> 2014-03-22 17:19:22.015904 7f000b985700  5 mds.0.37 ms_handle_remote_reset on 192.168.1.101:0/2977169062
    -5> 2014-03-22 17:19:22.015916 7f000b985700  3 mds.0.37 ms_handle_remote_reset closing connection for session client.6920 192.168.1.101:0/2977169062
    -4> 2014-03-22 17:19:22.015975 7f000b985700  1 -- 192.168.1.20:6802/5456 mark_down 0x300cdc0 -- 0x303b900
    -3> 2014-03-22 17:19:22.016023 7f000b985700  5 mds.0.37 ms_handle_reset on 192.168.1.101:0/2977169062
    -2> 2014-03-22 17:19:22.016035 7f000b985700  3 mds.0.37 ms_handle_reset closing connection for session client.6920 192.168.1.101:0/2977169062
    -1> 2014-03-22 17:19:22.016045 7f000b985700  1 -- 192.168.1.20:6802/5456 mark_down 0x300d4a0 -- 0x303b900
     0> 2014-03-22 17:19:22.017243 7f000b985700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f000b985700

 ceph version 0.77 (1bca9c5c412b3af722d5250f07fd562a23cf35ff)
 1: /usr/bin/ceph-mds() [0x8baf22]
 2: (()+0xf880) [0x7f00100a0880]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
   0/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 keyvaluestore
   1/ 3 journal
   0/ 5 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.MDS1.1.log
--- end dump of recent events ---

dmesg shows these.  No respawning.
[373738.961475] ceph-mds[21341]: segfault at 200 ip 00007f36c3d480b8 sp 00007f36c07d3520 error 4 in libgcc_s.so.1[7f36c3d39000+15000]

# ceph -s
    cluster 9b2c9bca-112e-48b0-86fc-587ef9a52948
     health HEALTH_WARN mds cluster is degraded
     monmap e1: 1 mons at {MDS1=192.168.1.20:6789/0}, election epoch 1, quorum 0 MDS1
     mdsmap e6977: 1/1/1 up {0=MDS1.2=up:replay}, 1 up:standby
     osdmap e821: 6 osds: 6 up, 6 in
      pgmap v97110: 192 pgs, 3 pools, 1542 GB data, 1453 kobjects
            3101 GB used, 8071 GB / 11172 GB avail
                 192 active+clean
# ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED 
    11172G     8071G     3101G        27.76     
POOLS:
    NAME         ID     USED      %USED     OBJECTS 
    data         0      1535G     13.74     1374040 
    metadata     1      561M      0         112765  
    rbd          2      6836M     0.06      1742    

Regards,
Hong


From: Mohd Bazli Ab Karim <bazli.abkarim@xxxxxxxx>
To: hjcho616 <hjcho616@xxxxxxxxx>; Luke Jing Yuan <jyluke@xxxxxxxx>
Cc: "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Sent: Friday, March 21, 2014 3:16 AM
Subject: RE: MDS crash when client goes to sleep

Hi Hong,
 
How’s the client now? Would it able to mount to the filesystem now? It looks similar to our case, http://www.spinics.net/lists/ceph-devel/msg18395.html
However, you need to collect some logs to confirm this.
 
Thanks.
 
 
From: hjcho616 [mailto:hjcho616@xxxxxxxxx]
Sent: Friday, March 21, 2014 2:30 PM
To: Luke Jing Yuan
Cc: Mohd Bazli Ab Karim; ceph-users@xxxxxxxxxxxxxx
Subject: Re: MDS crash when client goes to sleep
 
Luke,
 
Not sure what flapping ceph-mds daemon mean, but when I connected to MDS when this happened there no longer was any process with ceph-mds when I ran one daemon.  When I ran three there were one left but wasn't doing much.  I didn't record the logs but behavior was very similar in 0.72 emperor.  I am using debian packages.
 
Client went to sleep for a while (like 8+ hours).  There was no I/O prior to the sleep other than the fact that cephfs was still mounted.
 
Regards,
Hong
 

From: Luke Jing Yuan <jyluke@xxxxxxxx>
To: hjcho616 <hjcho616@xxxxxxxxx>
Cc: Mohd Bazli Ab Karim <bazli.abkarim@xxxxxxxx>; "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Sent: Friday, March 21, 2014 1:17 AM
Subject: RE: MDS crash when client goes to sleep

Hi Hong,

That's interesting, for Mr. Bazli and I, we ended with MDS stuck in (up:replay) and a flapping ceph-mds daemon, but then again we are using version 0.72.2. Having said so the triggering point seem similar to us as well, which is the following line:

  -38> 2014-03-20 20:08:44.495565 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION

So how long did your client go into sleep? Was there any I/O prior to the sleep?

Regards,
Luke

From: hjcho616 [mailto:hjcho616@xxxxxxxxx]
Sent: Friday, 21 March, 2014 12:09 PM
To: Luke Jing Yuan
Cc: Mohd Bazli Ab Karim; ceph-users@xxxxxxxxxxxxxx
Subject: Re: MDS crash when client goes to sleep

Nope just these segfaults.

[149884.709608] ceph-mds[17366]: segfault at 200 ip 00007f09de9d60b8 sp 00007f09db461520 error 4 in libgcc_s.so.1[7f09de9c7000+15000]
[211263.265402] ceph-mds[17135]: segfault at 200 ip 00007f59eec280b8 sp 00007f59eb6b3520 error 4 in libgcc_s.so.1[7f59eec19000+15000]
[214638.927759] ceph-mds[16896]: segfault at 200 ip 00007fcb2c89e0b8 sp 00007fcb29329520 error 4 in libgcc_s.so.1[7fcb2c88f000+15000]
[289338.461271] ceph-mds[20878]: segfault at 200 ip 00007f4b7211c0b8 sp 00007f4b6eba7520 error 4 in libgcc_s.so.1[7f4b7210d000+15000]
[373738.961475] ceph-mds[21341]: segfault at 200 ip 00007f36c3d480b8 sp 00007f36c07d3520 error 4 in libgcc_s.so.1[7f36c3d39000+15000]

Regards,
Hong

________________________________________
From: Luke Jing Yuan <jyluke@xxxxxxxx>
To: hjcho616 <hjcho616@xxxxxxxxx>
Cc: Mohd Bazli Ab Karim <bazli.abkarim@xxxxxxxx>; "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Sent: Thursday, March 20, 2014 10:53 PM
Subject: Re: MDS crash when client goes to sleep

Did you see any messages in dmesg saying ceph-mds respawnning or stuffs like that?

Regards,
Luke

On Mar 21, 2014, at 11:09 AM, "hjcho616" <hjcho616@xxxxxxxxx> wrote:
On client, I was no longer able to access the filesystem.  It would hang.  Makes sense since MDS has crashed.  I tried running 3 MDS demon on the same machine.  Two crashes and one appears to be hung up(?). ceph health says MDS is in degraded state when that happened.

I was able to recover by restarting every node.  I currently have three machine, one with MDS and MON, and two with OSDs.

It is failing everytime my client machine goes to sleep.  If you need me to run something let me know what and how.

Regards,
Hong

________________________________________
From: Mohd Bazli Ab Karim <bazli.abkarim@xxxxxxxx>
To: hjcho616 <hjcho616@xxxxxxxxx>; "ceph-users@xxxxxxxxxxxxxx" <ceph-users@xxxxxxxxxxxxxx>
Sent: Thursday, March 20, 2014 9:40 PM
Subject: RE: MDS crash when client goes to sleep

Hi Hong,
May I know what has happened to your MDS once it crashed? Was it able to recover from replay?
We also facing this issue and I am interested to know on how to reproduce it.

Thanks.
Bazli

From: ceph-users-bounces@xxxxxxxxxxxxxx [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of hjcho616
Sent: Friday, March 21, 2014 10:29 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject: MDS crash when client goes to sleep

When CephFS is mounted on a client and when client decides to go to sleep, MDS segfaults.  Has anyone seen this?  Below is a part of MDS log.  This happened in emperor and recent 0.77 release.  I am running Debian Wheezy with testing kernels 3.13.  What can I do to not crash the whole system if a client goes to sleep (and looks like disconnect may do the same)? Let me know if you need any more info.

Regards,
Hong

  -43> 2014-03-20 20:08:42.463357 7fee3f0cf700  1 -- 192.168.1.20:6801/17079 --> 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 -- ?+0 0x1ee9f080 con 0x2e56580
  -42> 2014-03-20 20:08:42.463787 7fee411d4700  1 -- 192.168.1.20:6801/17079 <== mon.0 192.168.1.20:6789/0 21764 ==== mdsbeacon(6798/MDS1.2 up:active seq 21120 v6970) v2 ==== 108+0+0 (266728949 0 0) 0x1ee88dc0 con 0x2e56580
  -41> 2014-03-20 20:08:43.373099 7fee3f0cf700  2 mds.0.cache check_memory_usage total 665384, rss 503156, heap 24656, malloc 463874 mmap 0, baseline 16464, buffers 0, max 1048576, 0 / 62380 inodes have caps, 0 caps, 0 caps per inode
  -40> 2014-03-20 20:08:44.494963 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 >> :/0 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept sd=18 192.168.1.101:52026/0
  -39> 2014-03-20 20:08:44.495033 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52026/0)
  -38> 2014-03-20 20:08:44.495565 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e2160).accept we reset (peer sent cseq 2), sending RESETSESSION
  -37> 2014-03-20 20:08:44.496015 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3f03b80 sd=18 :6801 s=4 pgs=0 cs=0 l=0 c=0x1f0e2160).fault 0: Success
  -36> 2014-03-20 20:08:44.496099 7fee411d4700  5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127
  -35> 2014-03-20 20:08:44.496120 7fee411d4700  3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127
  -34> 2014-03-20 20:08:44.496207 7fee411d4700  1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e2160 -- pipe dne
  -33> 2014-03-20 20:08:44.653628 7fee3d7c4700  1 -- 192.168.1.20:6801/17079 >> :/0 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept sd=18 192.168.1.101:52027/0
  -32> 2014-03-20 20:08:44.653677 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e22c0).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52027/0)
  -31> 2014-03-20 20:08:44.925618 7fee411d4700  1 -- 192.168.1.20:6801/17079 <== client.6019 192.168.1.101:0/2113152127 1 ==== client_reconnect(77349 caps) v2 ==== 0+0+11032578 (0 0 3293767716) 0x2e92780 con 0x1f0e22c0
  -30> 2014-03-20 20:08:44.925682 7fee411d4700  1 mds.0.server  no longer in reconnect state, ignoring reconnect, sending close
  -29> 2014-03-20 20:08:44.925735 7fee411d4700  0 log [INF] : denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45)
  -28> 2014-03-20 20:08:44.925748 7fee411d4700  1 -- 192.168.1.20:6801/17079 --> 192.168.1.101:0/2113152127 -- client_session(close) v1 -- ?+0 0x3ea6540 con 0x1f0e22c0
  -27> 2014-03-20 20:08:44.927727 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).reader couldn't read tag, Success
  -26> 2014-03-20 20:08:44.927797 7fee3d7c4700  2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault 0: Success
  -25> 2014-03-20 20:08:44.927849 7fee3d7c4700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3d8e000 sd=18 :6801 s=2 pgs=135 cs=1 l=0 c=0x1f0e22c0).fault, server, going to standby
  -24> 2014-03-20 20:08:46.372279 7fee401d2700 10 monclient: tick
  -23> 2014-03-20 20:08:46.372339 7fee401d2700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2014-03-20 20:08:16.372333)
  -22> 2014-03-20 20:08:46.372373 7fee401d2700 10 monclient: renew subs? (now: 2014-03-20 20:08:46.372372; renew after: 2014-03-20 20:09:56.370811) -- no
  -21> 2014-03-20 20:08:46.372403 7fee401d2700 10  log_queue is 1 last_log 2 sent 1 num 1 unsent 1 sending 1
  -20> 2014-03-20 20:08:46.372421 7fee401d2700 10  will send 2014-03-20 20:08:44.925741 mds.0 192.168.1.20:6801/17079 2 : [INF] denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45)
  -19> 2014-03-20 20:08:46.372466 7fee401d2700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
  -18> 2014-03-20 20:08:46.372483 7fee401d2700  1 -- 192.168.1.20:6801/17079 --> 192.168.1.20:6789/0 -- log(1 entries) v1 -- ?+0 0x71d8d80 con 0x2e56580
  -17> 2014-03-20 20:08:46.463496 7fee3f0cf700 10 monclient: _send_mon_message to mon.MDS1 at 192.168.1.20:6789/0
  -16> 2014-03-20 20:08:46.463524 7fee3f0cf700  1 -- 192.168.1.20:6801/17079 --> 192.168.1.20:6789/0 -- mdsbeacon(6798/MDS1.2 up:active seq 21121 v6970) v2 -- ?+0 0x1ee91340 con 0x2e56580
  -15> 2014-03-20 20:08:46.499688 7fee411d4700  1 -- 192.168.1.20:6801/17079 <== mon.0 192.168.1.20:6789/0 21765 ==== log(last 2) v1 ==== 24+0+0 (1174756693 0 0) 0x3ea7a40 con 0x2e56580
  -14> 2014-03-20 20:08:46.499751 7fee411d4700 10 handle_log_ack log(last 2) v1
  -13> 2014-03-20 20:08:46.499757 7fee411d4700 10  logged 2014-03-20 20:08:44.925741 mds.0 192.168.1.20:6801/17079 2 : [INF] denied reconnect attempt (mds is up:active) from client.6019 192.168.1.101:0/2113152127 after 2014-03-20 20:08:44.925679 (allowed interval 45)
  -12> 2014-03-20 20:08:46.499825 7fee411d4700  1 -- 192.168.1.20:6801/17079 <== mon.0 192.168.1.20:6789/0 21766 ==== mdsbeacon(6798/MDS1.2 up:active seq 21121 v6970) v2 ==== 108+0+0 (51773011 0 0) 0x1ee25600 con 0x2e56580
  -11> 2014-03-20 20:08:47.303323 7fee3ddca700  1 -- 192.168.1.20:6801/17079 >> :/0 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept sd=21 192.168.1.101:52029/0
  -10> 2014-03-20 20:08:47.308830 7fee3ddca700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept peer addr is really 192.168.1.101:0/2113152127 (socket is 192.168.1.101:52029/0)
    -9> 2014-03-20 20:08:47.309221 7fee3ddca700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept connect_seq 0 vs existing 1 state standby
    -8> 2014-03-20 20:08:47.309251 7fee3ddca700  0 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=0 pgs=0 cs=0 l=0 c=0x1f0e3080).accept peer reset, then tried to connect to us, replacing
    -7> 2014-03-20 20:08:47.309337 7fee411d4700  5 mds.0.35 ms_handle_remote_reset on 192.168.1.101:0/2113152127
    -6> 2014-03-20 20:08:47.309357 7fee411d4700  3 mds.0.35 ms_handle_remote_reset closing connection for session client.6019 192.168.1.101:0/2113152127
    -5> 2014-03-20 20:08:47.309385 7fee411d4700  1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e22c0 -- 0x3e92280
    -4> 2014-03-20 20:08:47.309478 7fee411d4700  5 mds.0.35 ms_handle_reset on 192.168.1.101:0/2113152127
    -3> 2014-03-20 20:08:47.309490 7fee411d4700  3 mds.0.35 ms_handle_reset closing connection for session client.6019 192.168.1.101:0/2113152127
    -2> 2014-03-20 20:08:47.309501 7fee411d4700  1 -- 192.168.1.20:6801/17079 mark_down 0x1f0e3080 -- 0x3e92280
    -1> 2014-03-20 20:08:47.309485 7fee3ddca700  2 -- 192.168.1.20:6801/17079 >> 192.168.1.101:0/2113152127 pipe(0x3e92280 sd=21 :6801 s=4 pgs=136 cs=1 l=0 c=0x1f0e22c0).accept read error on newly_acked_seq
    0> 2014-03-20 20:08:47.310724 7fee411d4700 -1 *** Caught signal (Segmentation fault) **
in thread 7fee411d4700

ceph version 0.77 (1bca9c5c412b3af722d5250f07fd562a23cf35ff)
1: /usr/bin/ceph-mds() [0x8baf22]
2: (()+0xf880) [0x7fee458ef880]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
  0/ 5 none
  0/ 1 lockdep
  0/ 1 context
  1/ 1 crush
  1/ 5 mds
  1/ 5 mds_balancer
  1/ 5 mds_locker
  1/ 5 mds_log
  1/ 5 mds_log_expire
  1/ 5 mds_migrator
  0/ 1 buffer
  0/ 1 timer
  0/ 1 filer
  0/ 1 striper
  0/ 1 objecter
  0/ 5 rados
  0/ 5 rbd
  0/ 5 journaler
  0/ 5 objectcacher
  0/ 5 client
  0/ 5 osd
  0/ 5 optracker
  0/ 5 objclass
  1/ 3 filestore
  1/ 3 keyvaluestore
  1/ 3 journal
  0/ 5 ms
  1/ 5 mon
  0/10 monc
  1/ 5 paxos
  0/ 5 tp
  1/ 5 auth
  1/ 5 crypto
  1/ 1 finisher
  1/ 5 heartbeatmap
  1/ 5 perfcounter
  1/ 5 rgw
  1/ 5 javaclient
  1/ 5 asok
  1/ 1 throttle
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent    10000
  max_new        1000
  log_file /var/log/ceph/ceph-mds.MDS1.2.log
--- end dump of recent events ---



________________________________________
DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________________
DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law.
------------------------------------------------------------------
-
-
DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s)
only and may contain confidential information. If you are not the
intended recipient, please note that any dealing, review,
distribution, printing, copying or use of this e-mail is strictly
prohibited. If you have received this email in error, please notify
the sender immediately and delete the original message.
MIMOS Berhad is a research and development institution under
the purview of the Malaysian Ministry of Science, Technology and
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad
and/or its subsidiaries shall be understood as neither given nor
endorsed by MIMOS Berhad and/or its subsidiaries and neither
MIMOS Berhad nor its subsidiaries accepts responsibility for the
same. All liability arising from or in connection with computer
viruses and/or corrupted e-mails is excluded to the fullest extent
permitted by law.



________________________________
DISCLAIMER:


This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments).


MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law.

------------------------------------------------------------------
-
-
DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s)
only and may contain confidential information. If you are not the
intended recipient, please note that any dealing, review,
distribution, printing, copying or use of this e-mail is strictly
prohibited. If you have received this email in error, please notify
the sender  immediately and delete the original message.
MIMOS Berhad is a research and development institution under
the purview of the Malaysian Ministry of Science, Technology and
Innovation. Opinions, conclusions and other information in this e-
mail that do not relate to the official business of MIMOS Berhad
and/or its subsidiaries shall be understood as neither given nor
endorsed by MIMOS Berhad and/or its subsidiaries and neither
MIMOS Berhad nor its subsidiaries accepts responsibility for the
same. All liability arising from or in connection with computer
viruses and/or corrupted e-mails is excluded to the fullest extent
permitted by law.





DISCLAIMER:

This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments).

MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law.


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux