It seems you have multiple active MDS. Multiple active MDS is not stable yet. Please use single active MDS. On Tue, Aug 16, 2016 at 8:10 PM, yu2xiangyang <yu2xiangyang@xxxxxxx> wrote: > I have found MDS restart several times between two MDS processes with > ACTIVE and BACKUP mode when I perform smallfile creating lots of files(3 > clients each with 8 threads creating 10000 files) . Would any one encounter > the same problem? Is there any configuration I can set ? Thank you for any > reply. > > Here is one of MDS logs. > 2016-08-16 19:53:43.246001 7f90e4864180 0 ceph version 10.2.2 > (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 21852 > 2016-08-16 19:53:43.246494 7f90e4864180 -1 deprecation warning: MDS id > 'mds.1' is invalid and will be forbidden in a future version. MDS names may > not start with a numeric digit. > 2016-08-16 19:53:43.248084 7f90e4864180 0 pidfile_write: ignore empty > --pid-file > 2016-08-16 19:53:44.369886 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:53:45.719945 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:53:46.812074 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:53:48.412859 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:53:51.967246 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:53:53.163012 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:53:56.930083 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:05.376155 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:09.801776 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:13.442563 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:17.019500 7f90dea57700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:17.220698 7f90dea57700 1 mds.0.137 handle_mds_map i am now > mds.0.137 > 2016-08-16 19:54:17.220704 7f90dea57700 1 mds.0.137 handle_mds_map state > change up:boot --> up:replay > 2016-08-16 19:54:17.220718 7f90dea57700 1 mds.0.137 replay_start > 2016-08-16 19:54:17.220728 7f90dea57700 1 mds.0.137 recovery set is > 2016-08-16 19:54:17.220734 7f90dea57700 1 mds.0.137 waiting for osdmap > 51053 (which blacklists prior instance) > 2016-08-16 19:54:17.291291 7f90d974a700 0 mds.0.cache creating system inode > with ino:100 > 2016-08-16 19:54:17.291548 7f90d974a700 0 mds.0.cache creating system inode > with ino:1 > 2016-08-16 19:54:18.871153 7f90d7b3c700 1 mds.0.137 replay_done > 2016-08-16 19:54:18.871166 7f90d7b3c700 1 mds.0.137 making mds journal > writeable > 2016-08-16 19:54:19.710851 7f90dea57700 1 mds.0.137 handle_mds_map i am now > mds.0.137 > 2016-08-16 19:54:19.710860 7f90dea57700 1 mds.0.137 handle_mds_map state > change up:replay --> up:reconnect > 2016-08-16 19:54:19.710874 7f90dea57700 1 mds.0.137 reconnect_start > 2016-08-16 19:54:19.710877 7f90dea57700 1 mds.0.137 reopen_log > 2016-08-16 19:54:19.710912 7f90dea57700 1 mds.0.server reconnect_clients -- > 5 sessions > 2016-08-16 19:54:19.711646 7f90d6931700 0 -- 192.168.5.12:6817/21852 >> > 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=0 pgs=0 cs=0 l=0 > c=0x7f90efbc6780).accept peer addr is really 192.168.5.9:0/2954821946 > (socket is 192.168.5.9:51609/0) > 2016-08-16 19:54:19.712664 7f90d652d700 0 -- 192.168.5.12:6817/21852 >> > 192.168.5.13:0/3688491801 pipe(0x7f90f02ac800 sd=63 :6817 s=0 pgs=0 cs=0 l=0 > c=0x7f90efbc6a80).accept peer addr is really 192.168.5.13:0/3688491801 > (socket is 192.168.5.13:57657/0) > 2016-08-16 19:54:19.713002 7f90dea57700 0 log_channel(cluster) log [DBG] : > reconnect by client.25434663 192.168.5.13:0/643433156 after 0.002023 > 2016-08-16 19:54:19.725704 7f90dea57700 0 log_channel(cluster) log [DBG] : > reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.014790 > 2016-08-16 19:54:19.728322 7f90dea57700 0 log_channel(cluster) log [DBG] : > reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.017410 > 2016-08-16 19:54:19.734812 7f90dea57700 0 log_channel(cluster) log [DBG] : > reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.023899 > 2016-08-16 19:54:19.740344 7f90d6129700 0 -- 192.168.5.12:6817/21852 >> > 192.168.5.8:0/1814981959 pipe(0x7f90f03a3400 sd=65 :6817 s=0 pgs=0 cs=0 l=0 > c=0x7f90efbc7c80).accept peer addr is really 192.168.5.8:0/1814981959 > (socket is 192.168.5.8:46034/0) > 2016-08-16 19:54:19.746170 7f90dea57700 0 log_channel(cluster) log [DBG] : > reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.035255 > 2016-08-16 19:54:19.746722 7f90dea57700 1 mds.0.137 reconnect_done > 2016-08-16 19:54:20.860114 7f90dea57700 1 mds.0.137 handle_mds_map i am now > mds.0.137 > 2016-08-16 19:54:20.860123 7f90dea57700 1 mds.0.137 handle_mds_map state > change up:reconnect --> up:rejoin > 2016-08-16 19:54:20.860138 7f90dea57700 1 mds.0.137 rejoin_start > 2016-08-16 19:54:20.870836 7f90dea57700 1 mds.0.137 rejoin_joint_start > 2016-08-16 19:54:21.115345 7f90da14d700 1 mds.0.137 rejoin_done > 2016-08-16 19:54:21.995720 7f90dea57700 1 mds.0.137 handle_mds_map i am now > mds.0.137 > 2016-08-16 19:54:21.995727 7f90dea57700 1 mds.0.137 handle_mds_map state > change up:rejoin --> up:clientreplay > 2016-08-16 19:54:21.995739 7f90dea57700 1 mds.0.137 recovery_done -- > successful recovery! > 2016-08-16 19:54:21.997568 7f90dea57700 1 mds.0.137 clientreplay_start > 2016-08-16 19:54:23.479318 7f90da14d700 1 mds.0.137 clientreplay_done > 2016-08-16 19:54:24.897786 7f90dea57700 1 mds.0.137 handle_mds_map i am now > mds.0.137 > 2016-08-16 19:54:24.897791 7f90dea57700 1 mds.0.137 handle_mds_map state > change up:clientreplay --> up:active > 2016-08-16 19:54:24.897800 7f90dea57700 1 mds.0.137 active_start > 2016-08-16 19:54:24.958018 7f90dea57700 1 mds.0.137 cluster recovered. > 2016-08-16 19:54:56.277730 7f90d6931700 0 -- 192.168.5.12:6817/21852 >> > 192.168.5.9:0/2954821946 pipe(0x7f90f02aa000 sd=61 :6817 s=2 pgs=886 cs=1 > l=0 c=0x7f90efbc6780).fault with nothing to send, going to standby > 2016-08-16 19:54:56.277851 7f90dea57700 1 mds.1 handle_mds_map i > (192.168.5.12:6817/21852) dne in the mdsmap, respawning myself > 2016-08-16 19:54:56.277855 7f90dea57700 1 mds.1 respawn > 2016-08-16 19:54:56.277857 7f90dea57700 1 mds.1 e: 'ceph-mds' > 2016-08-16 19:54:56.277859 7f90dea57700 1 mds.1 0: 'ceph-mds' > 2016-08-16 19:54:56.277860 7f90dea57700 1 mds.1 1: '-i' > 2016-08-16 19:54:56.277861 7f90dea57700 1 mds.1 2: '1' > 2016-08-16 19:54:56.277889 7f90dea57700 1 mds.1 exe_path /usr/bin/ceph-mds > 2016-08-16 19:54:56.321739 7f338352d180 0 ceph version 10.2.2 > (45107e21c568dd033c2f0a3107dec8f0b0e58374), process ceph-mds, pid 6850 > 2016-08-16 19:54:56.322227 7f338352d180 -1 deprecation warning: MDS id > 'mds.1' is invalid and will be forbidden in a future version. MDS names may > not start with a numeric digit. > 2016-08-16 19:54:56.322975 7f338352d180 0 pidfile_write: ignore empty > --pid-file > 2016-08-16 19:54:57.518515 7f337d720700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:58.626781 7f337d720700 1 mds.1 handle_mds_map standby > 2016-08-16 19:54:59.751755 7f337d720700 1 mds.1 handle_mds_map standby > 2016-08-16 19:55:07.036848 7f337d720700 1 mds.1 handle_mds_map standby > 2016-08-16 19:55:11.520076 7f337d720700 1 mds.1 handle_mds_map standby > 2016-08-16 19:55:15.079355 7f337d720700 1 mds.1 handle_mds_map standby > 2016-08-16 19:55:18.429603 7f337d720700 1 mds.0.139 handle_mds_map i am now > mds.0.139 > 2016-08-16 19:55:18.429609 7f337d720700 1 mds.0.139 handle_mds_map state > change up:boot --> up:replay > 2016-08-16 19:55:18.429630 7f337d720700 1 mds.0.139 replay_start > 2016-08-16 19:55:18.429638 7f337d720700 1 mds.0.139 recovery set is > 2016-08-16 19:55:18.429657 7f337d720700 1 mds.0.139 waiting for osdmap > 51055 (which blacklists prior instance) > 2016-08-16 19:55:18.435218 7f3378e16700 0 mds.0.cache creating system inode > with ino:100 > 2016-08-16 19:55:18.435712 7f3378e16700 0 mds.0.cache creating system inode > with ino:1 > 2016-08-16 19:55:18.813391 7f3376a07700 1 mds.0.139 replay_done > 2016-08-16 19:55:18.813425 7f3376a07700 1 mds.0.139 making mds journal > writeable > 2016-08-16 19:55:19.726668 7f33759fe700 0 -- 192.168.5.12:6818/6850 >> > 192.168.5.9:0/2954821946 pipe(0x7f3391d34800 sd=91 :6818 s=0 pgs=0 cs=0 l=0 > c=0x7f338df37900).accept peer addr is really 192.168.5.9:0/2954821946 > (socket is 192.168.5.9:56365/0) > 2016-08-16 19:55:19.727051 7f33758fd700 0 -- 192.168.5.12:6818/6850 >> > 192.168.5.13:0/3688491801 pipe(0x7f3391d32000 sd=92 :6818 s=0 pgs=0 cs=0 l=0 > c=0x7f338df37a80).accept peer addr is really 192.168.5.13:0/3688491801 > (socket is 192.168.5.13:59989/0) > 2016-08-16 19:55:19.733015 7f33756fb700 0 -- 192.168.5.12:6818/6850 >> > 192.168.5.8:0/1814981959 pipe(0x7f338f5c0000 sd=93 :6818 s=0 pgs=0 cs=0 l=0 > c=0x7f338f5bc180).accept peer addr is really 192.168.5.8:0/1814981959 > (socket is 192.168.5.8:39504/0) > 2016-08-16 19:55:19.745040 7f337d720700 1 mds.0.139 handle_mds_map i am now > mds.0.139 > 2016-08-16 19:55:19.745045 7f337d720700 1 mds.0.139 handle_mds_map state > change up:replay --> up:reconnect > 2016-08-16 19:55:19.745056 7f337d720700 1 mds.0.139 reconnect_start > 2016-08-16 19:55:19.745058 7f337d720700 1 mds.0.139 reopen_log > 2016-08-16 19:55:19.745069 7f337d720700 1 mds.0.server reconnect_clients -- > 5 sessions > 2016-08-16 19:55:19.745137 7f337d720700 0 log_channel(cluster) log [DBG] : > reconnect by client.25421481 192.168.5.9:0/2954821946 after 0.000008 > 2016-08-16 19:55:19.745279 7f337d720700 0 log_channel(cluster) log [DBG] : > reconnect by client.25434981 192.168.5.13:0/3688491801 after 0.000175 > 2016-08-16 19:55:19.745338 7f337d720700 0 log_channel(cluster) log [DBG] : > reconnect by client.25434663 192.168.5.13:0/643433156 after 0.000228 > 2016-08-16 19:55:19.745390 7f337d720700 0 log_channel(cluster) log [DBG] : > reconnect by client.25434930 192.168.5.8:0/1814981959 after 0.000292 > 2016-08-16 19:55:19.753707 7f337d720700 0 log_channel(cluster) log [DBG] : > reconnect by client.23765175 192.168.5.9:0/2024125279 after 0.008613 > 2016-08-16 19:55:19.753720 7f337d720700 1 mds.0.139 reconnect_done > 2016-08-16 19:55:20.936637 7f337d720700 1 mds.0.139 handle_mds_map i am now > mds.0.139 > 2016-08-16 19:55:20.936644 7f337d720700 1 mds.0.139 handle_mds_map state > change up:reconnect --> up:rejoin > 2016-08-16 19:55:20.936655 7f337d720700 1 mds.0.139 rejoin_start > 2016-08-16 19:55:20.939400 7f337d720700 1 mds.0.139 rejoin_joint_start > 2016-08-16 19:55:20.967137 7f337d720700 1 mds.0.139 rejoin_done > 2016-08-16 19:55:22.095034 7f337d720700 1 mds.0.139 handle_mds_map i am now > mds.0.139 > 2016-08-16 19:55:22.095044 7f337d720700 1 mds.0.139 handle_mds_map state > change up:rejoin --> up:active > 2016-08-16 19:55:22.095062 7f337d720700 1 mds.0.139 recovery_done -- > successful recovery! > 2016-08-16 19:55:22.095266 7f337d720700 1 mds.0.139 active_start > 2016-08-16 19:55:22.099007 7f337d720700 1 mds.0.139 cluster recovered. > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com