On Mon, Nov 14, 2016 at 11:35 PM, Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> wrote: > Hi John... > > Thanks for replying. > > Some of the requested input is inline. > > Cheers > > Goncalo > > >>> >>> >>> We are currently undergoing an infrastructure migration. One of the first >>> machines to go through this migration process is our standby-replay mds. >>> We >>> are running 10.2.2. My plan is to: >> >> Is the 10.2.2 here a typo? What's the current version that you're >> upgrading to 10.2.2 from? > > > There is no typo. We are not planning to upgrade (for now) but simply > redeploy the standby-replay mds server with the same version we currently > have everywhere 10.2.2. So, it is not an upgrade but a simple redeployment > in a different infrastructure. > > >> >>> - Shutdown the standby-replay mds >>> - Re install it in 10.2.2 in a different host, reusing the same IP, keys >>> and >>> configurations. >> >> Any particular reason for keeping the same IP? In general you don't >> need to worry about that at all: I'd usually just delete the old MDS >> entirely and create a new one, only keeping the ceph.conf section that >> configures your standby replay options. > > > It is just easier for us to reuse the same hostname and IP. > > >> >>> - Start the mds service >>> >>> I wasn't thinking this was problematic until I read: >>> http://tracker.ceph.com/issues/17466 >>> >>> The issue mentioned above was started when the site admin added a new >>> mds. >>> He also did an (unintended) upgrade of the mds(es) from 10.2.1 to 10.2.3 >>> but >>> I am not sure if this is the reason of the problem. His mons started to >>> fail >>> because they got an invalid fscid, and the reason is some incoherent >>> ordering of rank and fscid between the constructor and the struct. >> >> The actual issue (we think) was that the message decode was getting >> junk value for fscid when the beacon was sent by an older MDS due to a >> missing default initialisation, and then that the MDSMonitor was >> failing to validate that. >> >> This code path was only hit in cases where standby_for_rank was set, >> so for that particular symptom you should be okay if you just don't >> set standby_for_rank at all (if you have one MDS, your standby replay >> daemon will always pick up that rank). > > > This is our configuration for mds(es): > > [mds.rccephmds] > host = rccephmds > mds standby replay = true > > [mds.rccephmds2] > host = rccephmds2 > mds standby_for_rank = rccephmds > mds standby replay = true > > At the time we deployed these servers, I set up 'standby_for_rank' because > my understanding was that we had to specify the mds rank we would the > standby-replay mds to follow (replay its journal and keep a warm cash). > > From you comment, I understand that: > - My current config has the potential to trigger the issue mention above; > - However, since I only have one active mds, this config is unnecessary: the > standby-replay mds will start replaying the journal of that (single) active > mds rank. So, if I simple comment the 'standby_for_rank' config, i would be > safe and out of the problematic code. > > Can you just give a last confirmation word if my conclusions are correct? Actually your current standby_for_rank is probably being ignored, because that setting has to be an integer. To pass the name of another mds you'd use "standby_for_name". If you're not upgrading anything and just moving an MDS daemon then I don't think you have anything to worry about. I'd remove that standby_for_rank line anyway though. John > Thanks for the help (as always) > > Cheers > Goncalo > > > > > >> John >> >>> I just want to be sure that I won't hit a similar issue: >>> - In what exact circumstances is this problem triggered? >>> - Is it triggered when you add a brand new standby-replay mds (new IP, >>> new >>> key)? I am hopping that in my case, I shouldn't be affected. >>> >>> TIA >>> Goncalo >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > -- > Goncalo Borges > Research Computing > ARC Centre of Excellence for Particle Physics at the Terascale > School of Physics A28 | University of Sydney, NSW 2006 > T: +61 2 93511937 > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com