FWIW, below is our mds log when an MDS turn from standby-replay to active, it take really long time especially on rejoin. The FS do go with 100+ clients and a few million of files. So it seems that MDS restart/replace is not that **light weight** as we are thinking about. 2017-09-29 08:07:27.738118 7fe34d085700 1 mds.0.0 replay_done (as standby) 2017-09-29 08:07:28.835714 7fe34d085700 1 mds.0.0 replay_done (as standby) 2017-09-29 08:07:29.932846 7fe34d085700 1 mds.0.0 replay_done (as standby) 2017-09-29 08:07:31.034661 7fe353091700 1 mds.0.5419 handle_mds_map i am now mds.0.5419 2017-09-29 08:07:31.034663 7fe353091700 1 mds.0.5419 handle_mds_map state change up:standby-replay --> up:replay 2017-09-29 08:07:31.063181 7fe34d085700 1 mds.0.5419 replay_done (as standby) 2017-09-29 08:07:31.063201 7fe34d085700 1 mds.0.5419 standby_replay_restart (final takeover pass) 2017-09-29 08:07:31.168992 7fe34d085700 1 mds.0.5419 replay_done 2017-09-29 08:07:31.169005 7fe34d085700 1 mds.0.5419 making mds journal writeable 2017-09-29 08:07:32.046255 7fe353091700 1 mds.0.5419 handle_mds_map i am now mds.0.5419 2017-09-29 08:07:32.046257 7fe353091700 1 mds.0.5419 handle_mds_map state change up:replay --> up:resolve 2017-09-29 08:07:32.046265 7fe353091700 1 mds.0.5419 resolve_start 2017-09-29 08:07:32.046267 7fe353091700 1 mds.0.5419 reopen_log 2017-09-29 08:07:32.046274 7fe353091700 1 mds.0.5419 recovery set is 1 2017-09-29 08:08:14.856587 7fe353091700 1 mds.0.cache handle_mds_failure mds.1 : recovery peers are 1 2017-09-29 08:08:15.863834 7fe353091700 1 mds.0.5419 recovery set is 1 2017-09-29 08:08:15.868900 7fe356034700 0 -- 10.148.245.147:6800/3329170275 >> 10.148.245.145:6804/3466203682 conn(0x55b64a7c6000 :6800 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 0 vs existing csq=0 existing_state=STATE_CONNECTING 2017-09-29 08:08:19.957946 7fe353091700 1 mds.0.5419 resolve_done 2017-09-29 08:08:24.950118 7fe353091700 1 mds.0.5419 handle_mds_map i am now mds.0.5419 2017-09-29 08:08:24.950171 7fe353091700 1 mds.0.5419 handle_mds_map state change up:resolve --> up:reconnect 2017-09-29 08:08:24.950194 7fe353091700 1 mds.0.5419 reconnect_start 2017-09-29 08:08:24.950634 7fe353091700 1 mds.0.server reconnect_clients -- 186 sessions 2017-09-29 08:08:24.950839 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34337 10.148.245.251:0/2769043642 after 0.000084 2017-09-29 08:08:24.960074 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34330 10.148.185.72:0/2433467420 after 0.009367 2017-09-29 08:08:24.964139 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34322 10.148.190.129:0/647912525 after 0.013405 2017-09-29 08:08:24.964331 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.104245 10.148.184.144:0/2533535048 after 0.013667 2017-09-29 08:08:24.964843 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34319 10.148.190.128:0/99750648 after 0.014185 2017-09-29 08:08:24.966976 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.104212 10.148.184.140:0/1486286861 after 0.016276 2017-09-29 08:08:24.967093 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.104225 10.148.184.131:0/3388986983 after 0.016435 2017-09-29 08:08:24.967263 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.104255 10.148.184.130:0/4145162106 after 0.016546 2017-09-29 08:08:24.967344 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34313 10.148.190.155:0/2880915725 after 0.016690 2017-09-29 08:08:42.996768 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34328 10.148.178.59:0/3803632372 after 18.046088 2017-09-29 08:08:43.886804 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34315 10.148.190.180:0/3130069617 after 18.936122 2017-09-29 08:08:44.696920 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34280 10.148.190.176:0/3500883319 after 19.746237 2017-09-29 08:08:45.531987 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34225 10.148.178.35:0/1295837455 after 20.581320 2017-09-29 08:08:46.524033 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34160 10.148.178.28:0/3696130157 after 21.573369 2017-09-29 08:08:47.285174 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34324 10.148.190.167:0/2637778599 after 22.334416 2017-09-29 08:08:48.173818 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34140 10.148.178.37:0/474083565 after 23.223125 2017-09-29 08:08:48.976772 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34128 10.148.178.40:0/3470525153 after 24.026070 2017-09-29 08:08:49.680177 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34283 10.148.190.172:0/561253376 after 24.729396 2017-09-29 08:08:50.457263 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34231 10.148.178.21:0/686475218 after 25.506604 2017-09-29 08:08:51.216427 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34204 10.148.178.64:0/2775534546 after 26.265759 2017-09-29 08:08:52.063327 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34207 10.148.178.66:0/2705035982 after 27.112667 2017-09-29 08:08:52.726760 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34178 10.148.178.26:0/168300491 after 27.775987 2017-09-29 08:08:53.346579 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34169 10.148.178.31:0/561092381 after 28.395917 2017-09-29 08:08:53.998588 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34237 10.148.178.33:0/718178560 after 29.047928 2017-09-29 08:08:54.810011 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34219 10.148.178.41:0/3334906872 after 29.859354 2017-09-29 08:08:55.692896 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34187 10.148.178.63:0/2527423535 after 30.742233 2017-09-29 08:08:56.663520 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34287 10.148.190.175:0/1683041168 after 31.712852 2017-09-29 08:08:57.501470 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34272 10.148.190.178:0/2415576439 after 32.550770 2017-09-29 08:08:58.324454 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34175 10.148.178.43:0/2792043288 after 33.373769 2017-09-29 08:08:59.197349 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34143 10.148.178.49:0/119222329 after 34.246687 2017-09-29 08:08:59.981611 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34149 10.148.178.20:0/2783244186 after 35.030950 2017-09-29 08:09:00.884353 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34216 10.148.178.32:0/2085010403 after 35.933695 2017-09-29 08:09:01.687472 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34303 10.148.190.183:0/2441938068 after 36.736810 2017-09-29 08:09:03.390089 7fe353091700 0 log_channel(cluster) log [DBG] : reconnect by client.34112 10.148.245.113:0/556076324 after 38.439312 2017-09-29 08:09:03.390209 7fe353091700 1 mds.0.5419 reconnect_done 2017-09-29 08:09:03.451448 7fe353091700 1 mds.0.5419 handle_mds_map i am now mds.0.5419 2017-09-29 08:09:03.451468 7fe353091700 1 mds.0.5419 handle_mds_map state change up:reconnect --> up:rejoin 2017-09-29 08:09:03.451487 7fe353091700 1 mds.0.5419 rejoin_start 2017-09-29 08:09:08.306597 7fe353091700 1 mds.0.5419 rejoin_joint_start 2017-09-29 08:12:46.172231 7fe353091700 1 mds.0.5419 rejoin_done 2017-09-29 08:12:49.354507 7fe353091700 1 mds.0.5419 handle_mds_map i am now mds.0.5419 2017-09-29 08:12:49.354518 7fe353091700 1 mds.0.5419 handle_mds_map state change up:rejoin --> up:clientreplay 2017-09-29 08:12:49.354529 7fe353091700 1 mds.0.5419 recovery_done -- successful recovery! 2017-09-29 08:12:49.354944 7fe353091700 1 mds.0.5419 clientreplay_start 2017-09-29 08:12:50.325607 7fe34e888700 1 mds.0.5419 clientreplay_done 2017-09-29 08:12:50.983391 7fe353091700 1 mds.0.5419 handle_mds_map i am now mds.0.5419 2017-09-29 08:12:50.983393 7fe353091700 1 mds.0.5419 handle_mds_map state change up:clientreplay --> up:active 2017-09-29 08:12:50.983410 7fe353091700 1 mds.0.5419 active_start 2017-09-28 1:01 GMT+08:00 Travis Nielsen <Travis.Nielsen@xxxxxxxxxxx>: > Thanks for the clarification, and Rook does use Kubernetes facilities to > handle the log collection so it sounds like we're good to go. > > > > On 9/27/17, 9:45 AM, "John Spray" <jspray@xxxxxxxxxx> wrote: > >>On Wed, Sep 27, 2017 at 5:36 PM, Travis Nielsen >><Travis.Nielsen@xxxxxxxxxxx> wrote: >>> To expand on the scenario, I'm working in a Kubernetes environment where >>> the MDS instances are somewhat ephemeral. If an instance (pod) dies or >>>the >>> machine is restarted, Kubernetes will start a new one in its place. To >>> handle the failed pod scenario, I'd appreciate if you could help me >>> understand MDS better. >>> >>> 1) MDS instances are stateless, correct? If so, I'm assuming when an MDS >>> instance dies, a new MDS instance (with a new ID) can be brought up and >>> assigned its rank without any side effects other than disruption during >>> the failover. Or is there a reason to treat them more like mons that >>>need >>> to survive reboots and maintain state? >> >>Yep, completely stateless. Don't forget logs though -- for ephemeral >>instances, it would be a good idea to have them sending their logs >>somewhere central, so that we don't lose all the history whenever a >>container restarts (you may very well have already covered this in >>general in the context of Rook). >> >>> 2) Will there be any side effects from MDS instances being somewhat >>> ephemeral? For example, if a new instance came up every hour or every >>>day, >>> what challenges would I run into besides cleaning up the old cephx keys? >> >>While switching daemons around is an online operation, it is not >>without some impact to client IOs, and the freshly started MDS daemon >>will generally have a less well populated cache than the one it is >>replacing. >> >>John >> >>> >>> Thanks! >>> Travis >>> >>> >>> >>> >>> On 9/27/17, 3:01 AM, "John Spray" <jspray@xxxxxxxxxx> wrote: >>> >>>>On Wed, Sep 27, 2017 at 12:09 AM, Travis Nielsen >>>><Travis.Nielsen@xxxxxxxxxxx> wrote: >>>>> Is it possible to use the same cephx key for all instances of MDS or >>>>>do >>>>> they each require their own? Mons require the same keyring so I tried >>>>> following the same pattern by creating a keyring with "mds.", but the >>>>>MDS >>>>> is complaining about not being authorized when it tries to start. Am I >>>>> missing something or is this not possible for MDS keys? If I create a >>>>> unique key for each MDS instance it works fine, but it would simplify >>>>>my >>>>> scenario if I could use the same key. I'm running on Luminous. >>>> >>>>I've never heard of anyone trying to do this. >>>> >>>>It's probably not a great idea, because if all MDS daemons are using >>>>the same key then you lose the ability to simply remove an MDS's key >>>>to ensure that it can't talk to the system any more. This is useful >>>>when tearing something down, because it means you're not taking it on >>>>faith that the daemon is really physically stopped. >>>> >>>>John >>>> >>>>> The key was generated with this: >>>>> ceph auth get-or-create-key mds. osd allow * mds allow mon allow >>>>>profile >>>>> mds >>>>> >>>>> >>>>> >>>>> The keyring contents are: >>>>> [mds.] >>>>> key = AQD62spZw3zRGhAAkHHVokP3BDf8PEy4+vXGMg== >>>>> >>>>> >>>>> I run the following with that keyring: >>>>> ceph-mds --foreground --name=mds.mymds -i mymds >>>>> >>>>> And I see the error: >>>>> 2017-09-26 22:55:55.973047 7fb004459200 -1 mds.mds81c2n ERROR: failed >>>>>to >>>>> authenticate: (22) Invalid argument >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Travis >>>>> >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at >>>>>https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fvger.ke >>>>>rn >>>>>el.org%2Fmajordomo-info.html&data=02%7C01%7CTravis.Nielsen%40quantum.co >>>>>m% >>>>>7C00d1db42478d48fa8c6508d5058ec254%7C322a135f14fb4d72aede122272134ae0%7 >>>>>C1 >>>>>%7C0%7C636421033061815149&sdata=3Vu79xeZbnb1jwhGE85PACq6qByVE6vUlPjp8pj >>>>>rv >>>>>hA%3D&reserved=0 >>> > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html