Re: MDS restart when create million of files with smallfile tool

"Yan, Zheng" <ukernel@xxxxxxxxx> · Tue, 16 Aug 2016 21:29:22 +0800

It seems you have multiple active MDS. Multiple active MDS is not
stable yet. Please use single active MDS.

On Tue, Aug 16, 2016 at 8:10 PM, yu2xiangyang <yu2xiangyang@xxxxxxx> wrote:
> I have found MDS restart several times  between two MDS processes with
> ACTIVE and BACKUP mode when I perform smallfile  creating  lots of files(3
> clients each with 8 threads creating 10000 files) . Would any one encounter
> the same problem?  Is there any configuration I can set ? Thank you for any
> reply.
>
> Here is one of MDS logs.
> 2016-08-16 19:53:43.246001 7f90e4864180  0 ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 21852
> 2016-08-16 19:53:43.246494 7f90e4864180 -1 deprecation warning: MDS id
> 'mds.1' is invalid and will be forbidden in a future version.  MDS names may
> not start with a numeric digit.
> 2016-08-16 19:53:43.248084 7f90e4864180  0 pidfile_write: ignore empty
> --pid-file
> 2016-08-16 19:53:44.369886 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:45.719945 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:46.812074 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:48.412859 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:51.967246 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:53.163012 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:53:56.930083 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:05.376155 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:09.801776 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:13.442563 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:17.019500 7f90dea57700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:17.220698 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:17.220704 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:boot --> up:replay
> 2016-08-16 19:54:17.220718 7f90dea57700  1 mds.0.137 replay_start
> 2016-08-16 19:54:17.220728 7f90dea57700  1 mds.0.137  recovery set is
> 2016-08-16 19:54:17.220734 7f90dea57700  1 mds.0.137  waiting for osdmap
> 51053 (which blacklists prior instance)
> 2016-08-16 19:54:17.291291 7f90d974a700  0 mds.0.cache creating system inode
> with ino:100
> 2016-08-16 19:54:17.291548 7f90d974a700  0 mds.0.cache creating system inode
> with ino:1
> 2016-08-16 19:54:18.871153 7f90d7b3c700  1 mds.0.137 replay_done
> 2016-08-16 19:54:18.871166 7f90d7b3c700  1 mds.0.137 making mds journal
> writeable
> 2016-08-16 19:54:19.710851 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:19.710860 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:replay --> up:reconnect
> 2016-08-16 19:54:19.710874 7f90dea57700  1 mds.0.137 reconnect_start
> 2016-08-16 19:54:19.710877 7f90dea57700  1 mds.0.137 reopen_log
> 2016-08-16 19:54:19.710912 7f90dea57700  1 mds.0.server reconnect_clients --
> 5 sessions
> 2016-08-16 19:54:19.711646 7f90d6931700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=0 pgs=0 cs=0 l=0
> c=0x7f90efbc6780).accept peer addr is really 192.168.5.9:0/2954821946
> (socket is 192.168.5.9:51609/0)
> 2016-08-16 19:54:19.712664 7f90d652d700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.13:0/3688491801 pipe(0x7f90f02ac800 sd=63 :6817 s=0 pgs=0 cs=0 l=0
> c=0x7f90efbc6a80).accept peer addr is really 192.168.5.13:0/3688491801
> (socket is 192.168.5.13:57657/0)
> 2016-08-16 19:54:19.713002 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434663 192.168.5.13:0/643433156 after 0.002023
> 2016-08-16 19:54:19.725704 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.014790
> 2016-08-16 19:54:19.728322 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.017410
> 2016-08-16 19:54:19.734812 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.023899
> 2016-08-16 19:54:19.740344 7f90d6129700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.8:0/1814981959 pipe(0x7f90f03a3400 sd=65 :6817 s=0 pgs=0 cs=0 l=0
> c=0x7f90efbc7c80).accept peer addr is really 192.168.5.8:0/1814981959
> (socket is 192.168.5.8:46034/0)
> 2016-08-16 19:54:19.746170 7f90dea57700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.035255
> 2016-08-16 19:54:19.746722 7f90dea57700  1 mds.0.137 reconnect_done
> 2016-08-16 19:54:20.860114 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:20.860123 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2016-08-16 19:54:20.860138 7f90dea57700  1 mds.0.137 rejoin_start
> 2016-08-16 19:54:20.870836 7f90dea57700  1 mds.0.137 rejoin_joint_start
> 2016-08-16 19:54:21.115345 7f90da14d700  1 mds.0.137 rejoin_done
> 2016-08-16 19:54:21.995720 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:21.995727 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:rejoin --> up:clientreplay
> 2016-08-16 19:54:21.995739 7f90dea57700  1 mds.0.137 recovery_done --
> successful recovery!
> 2016-08-16 19:54:21.997568 7f90dea57700  1 mds.0.137 clientreplay_start
> 2016-08-16 19:54:23.479318 7f90da14d700  1 mds.0.137 clientreplay_done
> 2016-08-16 19:54:24.897786 7f90dea57700  1 mds.0.137 handle_mds_map i am now
> mds.0.137
> 2016-08-16 19:54:24.897791 7f90dea57700  1 mds.0.137 handle_mds_map state
> change up:clientreplay --> up:active
> 2016-08-16 19:54:24.897800 7f90dea57700  1 mds.0.137 active_start
> 2016-08-16 19:54:24.958018 7f90dea57700  1 mds.0.137 cluster recovered.
> 2016-08-16 19:54:56.277730 7f90d6931700  0 -- 192.168.5.12:6817/21852 >>
> 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=2 pgs=886 cs=1
> l=0 c=0x7f90efbc6780).fault with nothing to send, going to standby
> 2016-08-16 19:54:56.277851 7f90dea57700  1 mds.1 handle_mds_map i
> (192.168.5.12:6817/21852) dne in the mdsmap, respawning myself
> 2016-08-16 19:54:56.277855 7f90dea57700  1 mds.1 respawn
> 2016-08-16 19:54:56.277857 7f90dea57700  1 mds.1  e: 'ceph-mds'
> 2016-08-16 19:54:56.277859 7f90dea57700  1 mds.1  0: 'ceph-mds'
> 2016-08-16 19:54:56.277860 7f90dea57700  1 mds.1  1: '-i'
> 2016-08-16 19:54:56.277861 7f90dea57700  1 mds.1  2: '1'
> 2016-08-16 19:54:56.277889 7f90dea57700  1 mds.1  exe_path /usr/bin/ceph-mds
> 2016-08-16 19:54:56.321739 7f338352d180  0 ceph version 10.2.2
> (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 6850
> 2016-08-16 19:54:56.322227 7f338352d180 -1 deprecation warning: MDS id
> 'mds.1' is invalid and will be forbidden in a future version.  MDS names may
> not start with a numeric digit.
> 2016-08-16 19:54:56.322975 7f338352d180  0 pidfile_write: ignore empty
> --pid-file
> 2016-08-16 19:54:57.518515 7f337d720700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:58.626781 7f337d720700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:54:59.751755 7f337d720700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:55:07.036848 7f337d720700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:55:11.520076 7f337d720700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:55:15.079355 7f337d720700  1 mds.1 handle_mds_map standby
> 2016-08-16 19:55:18.429603 7f337d720700  1 mds.0.139 handle_mds_map i am now
> mds.0.139
> 2016-08-16 19:55:18.429609 7f337d720700  1 mds.0.139 handle_mds_map state
> change up:boot --> up:replay
> 2016-08-16 19:55:18.429630 7f337d720700  1 mds.0.139 replay_start
> 2016-08-16 19:55:18.429638 7f337d720700  1 mds.0.139  recovery set is
> 2016-08-16 19:55:18.429657 7f337d720700  1 mds.0.139  waiting for osdmap
> 51055 (which blacklists prior instance)
> 2016-08-16 19:55:18.435218 7f3378e16700  0 mds.0.cache creating system inode
> with ino:100
> 2016-08-16 19:55:18.435712 7f3378e16700  0 mds.0.cache creating system inode
> with ino:1
> 2016-08-16 19:55:18.813391 7f3376a07700  1 mds.0.139 replay_done
> 2016-08-16 19:55:18.813425 7f3376a07700  1 mds.0.139 making mds journal
> writeable
> 2016-08-16 19:55:19.726668 7f33759fe700  0 -- 192.168.5.12:6818/6850 >>
> 192.168.5.9:0/2954821946 pipe(0x7f3391d34800 sd=91 :6818 s=0 pgs=0 cs=0 l=0
> c=0x7f338df37900).accept peer addr is really 192.168.5.9:0/2954821946
> (socket is 192.168.5.9:56365/0)
> 2016-08-16 19:55:19.727051 7f33758fd700  0 -- 192.168.5.12:6818/6850 >>
> 192.168.5.13:0/3688491801 pipe(0x7f3391d32000 sd=92 :6818 s=0 pgs=0 cs=0 l=0
> c=0x7f338df37a80).accept peer addr is really 192.168.5.13:0/3688491801
> (socket is 192.168.5.13:59989/0)
> 2016-08-16 19:55:19.733015 7f33756fb700  0 -- 192.168.5.12:6818/6850 >>
> 192.168.5.8:0/1814981959 pipe(0x7f338f5c0000 sd=93 :6818 s=0 pgs=0 cs=0 l=0
> c=0x7f338f5bc180).accept peer addr is really 192.168.5.8:0/1814981959
> (socket is 192.168.5.8:39504/0)
> 2016-08-16 19:55:19.745040 7f337d720700  1 mds.0.139 handle_mds_map i am now
> mds.0.139
> 2016-08-16 19:55:19.745045 7f337d720700  1 mds.0.139 handle_mds_map state
> change up:replay --> up:reconnect
> 2016-08-16 19:55:19.745056 7f337d720700  1 mds.0.139 reconnect_start
> 2016-08-16 19:55:19.745058 7f337d720700  1 mds.0.139 reopen_log
> 2016-08-16 19:55:19.745069 7f337d720700  1 mds.0.server reconnect_clients --
> 5 sessions
> 2016-08-16 19:55:19.745137 7f337d720700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.000008
> 2016-08-16 19:55:19.745279 7f337d720700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.000175
> 2016-08-16 19:55:19.745338 7f337d720700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434663 192.168.5.13:0/643433156 after 0.000228
> 2016-08-16 19:55:19.745390 7f337d720700  0 log_channel(cluster) log [DBG] :
> reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.000292
> 2016-08-16 19:55:19.753707 7f337d720700  0 log_channel(cluster) log [DBG] :
> reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.008613
> 2016-08-16 19:55:19.753720 7f337d720700  1 mds.0.139 reconnect_done
> 2016-08-16 19:55:20.936637 7f337d720700  1 mds.0.139 handle_mds_map i am now
> mds.0.139
> 2016-08-16 19:55:20.936644 7f337d720700  1 mds.0.139 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2016-08-16 19:55:20.936655 7f337d720700  1 mds.0.139 rejoin_start
> 2016-08-16 19:55:20.939400 7f337d720700  1 mds.0.139 rejoin_joint_start
> 2016-08-16 19:55:20.967137 7f337d720700  1 mds.0.139 rejoin_done
> 2016-08-16 19:55:22.095034 7f337d720700  1 mds.0.139 handle_mds_map i am now
> mds.0.139
> 2016-08-16 19:55:22.095044 7f337d720700  1 mds.0.139 handle_mds_map state
> change up:rejoin --> up:active
> 2016-08-16 19:55:22.095062 7f337d720700  1 mds.0.139 recovery_done --
> successful recovery!
> 2016-08-16 19:55:22.095266 7f337d720700  1 mds.0.139 active_start
> 2016-08-16 19:55:22.099007 7f337d720700  1 mds.0.139 cluster recovered.
>
>
>
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com