MDS restart when create million of files with smallfile tool

yu2xiangyang <yu2xiangyang@xxxxxxx> · Tue, 16 Aug 2016 20:10:33 +0800 (CST)

I have found MDS restart several times  between two MDS processes with ACTIVE and BACKUP mode when I perform smallfile  creating  lots of files(3 clients each with 8 threads creating 10000 files) . Would any one encounter  the same problem?  Is there any configuration I can set ? Thank you for any reply.

Here is one of MDS logs.
2016-08-16 19:53:43.246001 7f90e4864180  0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 21852
2016-08-16 19:53:43.246494 7f90e4864180 -1 deprecation warning: MDS id 'mds.1' is invalid and will be forbidden in a future version.  MDS names may not start with a numeric digit.
2016-08-16 19:53:43.248084 7f90e4864180  0 pidfile_write: ignore empty --pid-file
2016-08-16 19:53:44.369886 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:45.719945 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:46.812074 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:48.412859 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:51.967246 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:53.163012 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:53:56.930083 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:05.376155 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:09.801776 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:13.442563 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:17.019500 7f90dea57700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:17.220698 7f90dea57700  1 mds.0.137 handle_mds_map i am now mds.0.137
2016-08-16 19:54:17.220704 7f90dea57700  1 mds.0.137 handle_mds_map state change up:boot --> up:replay
2016-08-16 19:54:17.220718 7f90dea57700  1 mds.0.137 replay_start
2016-08-16 19:54:17.220728 7f90dea57700  1 mds.0.137  recovery set is
2016-08-16 19:54:17.220734 7f90dea57700  1 mds.0.137  waiting for osdmap 51053 (which blacklists prior instance)
2016-08-16 19:54:17.291291 7f90d974a700  0 mds.0.cache creating system inode with ino:100
2016-08-16 19:54:17.291548 7f90d974a700  0 mds.0.cache creating system inode with ino:1
2016-08-16 19:54:18.871153 7f90d7b3c700  1 mds.0.137 replay_done
2016-08-16 19:54:18.871166 7f90d7b3c700  1 mds.0.137 making mds journal writeable
2016-08-16 19:54:19.710851 7f90dea57700  1 mds.0.137 handle_mds_map i am now mds.0.137
2016-08-16 19:54:19.710860 7f90dea57700  1 mds.0.137 handle_mds_map state change up:replay --> up:reconnect
2016-08-16 19:54:19.710874 7f90dea57700  1 mds.0.137 reconnect_start
2016-08-16 19:54:19.710877 7f90dea57700  1 mds.0.137 reopen_log
2016-08-16 19:54:19.710912 7f90dea57700  1 mds.0.server reconnect_clients -- 5 sessions
2016-08-16 19:54:19.711646 7f90d6931700  0 -- 192.168.5.12:6817/21852 >> 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=0 pgs=0 cs=0 l=0 c=0x7f90efbc6780).accept peer addr is really 192.168.5.9:0/2954821946 (socket is 192.168.5.9:51609/0)
2016-08-16 19:54:19.712664 7f90d652d700  0 -- 192.168.5.12:6817/21852 >> 192.168.5.13:0/3688491801 pipe(0x7f90f02ac800 sd=63 :6817 s=0 pgs=0 cs=0 l=0 c=0x7f90efbc6a80).accept peer addr is really 192.168.5.13:0/3688491801 (socket is 192.168.5.13:57657/0)
2016-08-16 19:54:19.713002 7f90dea57700  0 log_channel(cluster) log [DBG] : reconnect by client.25434663 192.168.5.13:0/643433156 after 0.002023
2016-08-16 19:54:19.725704 7f90dea57700  0 log_channel(cluster) log [DBG] : reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.014790
2016-08-16 19:54:19.728322 7f90dea57700  0 log_channel(cluster) log [DBG] : reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.017410
2016-08-16 19:54:19.734812 7f90dea57700  0 log_channel(cluster) log [DBG] : reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.023899
2016-08-16 19:54:19.740344 7f90d6129700  0 -- 192.168.5.12:6817/21852 >> 192.168.5.8:0/1814981959 pipe(0x7f90f03a3400 sd=65 :6817 s=0 pgs=0 cs=0 l=0 c=0x7f90efbc7c80).accept peer addr is really 192.168.5.8:0/1814981959 (socket is 192.168.5.8:46034/0)
2016-08-16 19:54:19.746170 7f90dea57700  0 log_channel(cluster) log [DBG] : reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.035255
2016-08-16 19:54:19.746722 7f90dea57700  1 mds.0.137 reconnect_done
2016-08-16 19:54:20.860114 7f90dea57700  1 mds.0.137 handle_mds_map i am now mds.0.137
2016-08-16 19:54:20.860123 7f90dea57700  1 mds.0.137 handle_mds_map state change up:reconnect --> up:rejoin
2016-08-16 19:54:20.860138 7f90dea57700  1 mds.0.137 rejoin_start
2016-08-16 19:54:20.870836 7f90dea57700  1 mds.0.137 rejoin_joint_start
2016-08-16 19:54:21.115345 7f90da14d700  1 mds.0.137 rejoin_done
2016-08-16 19:54:21.995720 7f90dea57700  1 mds.0.137 handle_mds_map i am now mds.0.137
2016-08-16 19:54:21.995727 7f90dea57700  1 mds.0.137 handle_mds_map state change up:rejoin --> up:clientreplay
2016-08-16 19:54:21.995739 7f90dea57700  1 mds.0.137 recovery_done -- successful recovery!
2016-08-16 19:54:21.997568 7f90dea57700  1 mds.0.137 clientreplay_start
2016-08-16 19:54:23.479318 7f90da14d700  1 mds.0.137 clientreplay_done
2016-08-16 19:54:24.897786 7f90dea57700  1 mds.0.137 handle_mds_map i am now mds.0.137
2016-08-16 19:54:24.897791 7f90dea57700  1 mds.0.137 handle_mds_map state change up:clientreplay --> up:active
2016-08-16 19:54:24.897800 7f90dea57700  1 mds.0.137 active_start
2016-08-16 19:54:24.958018 7f90dea57700  1 mds.0.137 cluster recovered.
2016-08-16 19:54:56.277730 7f90d6931700  0 -- 192.168.5.12:6817/21852 >> 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=2 pgs=886 cs=1 l=0 c=0x7f90efbc6780).fault with nothing to send, going to standby
2016-08-16 19:54:56.277851 7f90dea57700  1 mds.1 handle_mds_map i (192.168.5.12:6817/21852) dne in the mdsmap, respawning myself
2016-08-16 19:54:56.277855 7f90dea57700  1 mds.1 respawn
2016-08-16 19:54:56.277857 7f90dea57700  1 mds.1  e: 'ceph-mds'
2016-08-16 19:54:56.277859 7f90dea57700  1 mds.1  0: 'ceph-mds'
2016-08-16 19:54:56.277860 7f90dea57700  1 mds.1  1: '-i'
2016-08-16 19:54:56.277861 7f90dea57700  1 mds.1  2: '1'
2016-08-16 19:54:56.277889 7f90dea57700  1 mds.1  exe_path /usr/bin/ceph-mds
2016-08-16 19:54:56.321739 7f338352d180  0 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 6850
2016-08-16 19:54:56.322227 7f338352d180 -1 deprecation warning: MDS id 'mds.1' is invalid and will be forbidden in a future version.  MDS names may not start with a numeric digit.
2016-08-16 19:54:56.322975 7f338352d180  0 pidfile_write: ignore empty --pid-file
2016-08-16 19:54:57.518515 7f337d720700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:58.626781 7f337d720700  1 mds.1 handle_mds_map standby
2016-08-16 19:54:59.751755 7f337d720700  1 mds.1 handle_mds_map standby
2016-08-16 19:55:07.036848 7f337d720700  1 mds.1 handle_mds_map standby
2016-08-16 19:55:11.520076 7f337d720700  1 mds.1 handle_mds_map standby
2016-08-16 19:55:15.079355 7f337d720700  1 mds.1 handle_mds_map standby
2016-08-16 19:55:18.429603 7f337d720700  1 mds.0.139 handle_mds_map i am now mds.0.139
2016-08-16 19:55:18.429609 7f337d720700  1 mds.0.139 handle_mds_map state change up:boot --> up:replay
2016-08-16 19:55:18.429630 7f337d720700  1 mds.0.139 replay_start
2016-08-16 19:55:18.429638 7f337d720700  1 mds.0.139  recovery set is
2016-08-16 19:55:18.429657 7f337d720700  1 mds.0.139  waiting for osdmap 51055 (which blacklists prior instance)
2016-08-16 19:55:18.435218 7f3378e16700  0 mds.0.cache creating system inode with ino:100
2016-08-16 19:55:18.435712 7f3378e16700  0 mds.0.cache creating system inode with ino:1
2016-08-16 19:55:18.813391 7f3376a07700  1 mds.0.139 replay_done
2016-08-16 19:55:18.813425 7f3376a07700  1 mds.0.139 making mds journal writeable
2016-08-16 19:55:19.726668 7f33759fe700  0 -- 192.168.5.12:6818/6850 >> 192.168.5.9:0/2954821946 pipe(0x7f3391d34800 sd=91 :6818 s=0 pgs=0 cs=0 l=0 c=0x7f338df37900).accept peer addr is really 192.168.5.9:0/2954821946 (socket is 192.168.5.9:56365/0)
2016-08-16 19:55:19.727051 7f33758fd700  0 -- 192.168.5.12:6818/6850 >> 192.168.5.13:0/3688491801 pipe(0x7f3391d32000 sd=92 :6818 s=0 pgs=0 cs=0 l=0 c=0x7f338df37a80).accept peer addr is really 192.168.5.13:0/3688491801 (socket is 192.168.5.13:59989/0)
2016-08-16 19:55:19.733015 7f33756fb700  0 -- 192.168.5.12:6818/6850 >> 192.168.5.8:0/1814981959 pipe(0x7f338f5c0000 sd=93 :6818 s=0 pgs=0 cs=0 l=0 c=0x7f338f5bc180).accept peer addr is really 192.168.5.8:0/1814981959 (socket is 192.168.5.8:39504/0)
2016-08-16 19:55:19.745040 7f337d720700  1 mds.0.139 handle_mds_map i am now mds.0.139
2016-08-16 19:55:19.745045 7f337d720700  1 mds.0.139 handle_mds_map state change up:replay --> up:reconnect
2016-08-16 19:55:19.745056 7f337d720700  1 mds.0.139 reconnect_start
2016-08-16 19:55:19.745058 7f337d720700  1 mds.0.139 reopen_log
2016-08-16 19:55:19.745069 7f337d720700  1 mds.0.server reconnect_clients -- 5 sessions
2016-08-16 19:55:19.745137 7f337d720700  0 log_channel(cluster) log [DBG] : reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.000008
2016-08-16 19:55:19.745279 7f337d720700  0 log_channel(cluster) log [DBG] : reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.000175
2016-08-16 19:55:19.745338 7f337d720700  0 log_channel(cluster) log [DBG] : reconnect by client.25434663 192.168.5.13:0/643433156 after 0.000228
2016-08-16 19:55:19.745390 7f337d720700  0 log_channel(cluster) log [DBG] : reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.000292
2016-08-16 19:55:19.753707 7f337d720700  0 log_channel(cluster) log [DBG] : reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.008613
2016-08-16 19:55:19.753720 7f337d720700  1 mds.0.139 reconnect_done
2016-08-16 19:55:20.936637 7f337d720700  1 mds.0.139 handle_mds_map i am now mds.0.139
2016-08-16 19:55:20.936644 7f337d720700  1 mds.0.139 handle_mds_map state change up:reconnect --> up:rejoin
2016-08-16 19:55:20.936655 7f337d720700  1 mds.0.139 rejoin_start
2016-08-16 19:55:20.939400 7f337d720700  1 mds.0.139 rejoin_joint_start
2016-08-16 19:55:20.967137 7f337d720700  1 mds.0.139 rejoin_done
2016-08-16 19:55:22.095034 7f337d720700  1 mds.0.139 handle_mds_map i am now mds.0.139
2016-08-16 19:55:22.095044 7f337d720700  1 mds.0.139 handle_mds_map state change up:rejoin --> up:active
2016-08-16 19:55:22.095062 7f337d720700  1 mds.0.139 recovery_done -- successful recovery!
2016-08-16 19:55:22.095266 7f337d720700  1 mds.0.139 active_start
2016-08-16 19:55:22.099007 7f337d720700  1 mds.0.139 cluster recovered.

 _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com