Re: Standby-replay mds: 10.2.2

Goncalo Borges <goncalo.borges@xxxxxxxxxxxxx> · Tue, 15 Nov 2016 10:35:35 +1100

Hi John...

Thanks for replying.

Some of the requested input is inline.

Cheers

Goncalo

We are currently undergoing an infrastructure migration. One of the first
machines to go through this migration process is our standby-replay mds. We
are running 10.2.2. My plan is to:
Is the 10.2.2 here a typo?  What's the current version that you're
upgrading to 10.2.2 from?

There is no typo. We are not planning to upgrade (for now) but simply 
redeploy the standby-replay mds server with the same version we 
currently have everywhere 10.2.2. So, it is not an upgrade but a simple 
redeployment in a different infrastructure.

- Shutdown the standby-replay mds
- Re install it in 10.2.2 in a different host, reusing the same IP, keys and
configurations.
Any particular reason for keeping the same IP?  In general you don't
need to worry about that at all: I'd usually just delete the old MDS
entirely and create a new one, only keeping the ceph.conf section that
configures your standby replay options.

It is just easier for us to reuse the same hostname and IP.

- Start the mds service

I wasn't thinking this was problematic until I read:
http://tracker.ceph.com/issues/17466

The issue mentioned above was started when the site admin added a new mds.
He also did an (unintended) upgrade of the mds(es) from 10.2.1 to 10.2.3 but
I am not sure if this is the reason of the problem. His mons started to fail
because they got an invalid fscid, and the reason is some incoherent
ordering of rank and fscid between the constructor and the struct.
The actual issue (we think) was that the message decode was getting
junk value for fscid when the beacon was sent by an older MDS due to a
missing default initialisation, and then that the MDSMonitor was
failing to validate that.

This code path was only hit in cases where standby_for_rank was set,
so for that particular symptom you should be okay if you just don't
set standby_for_rank at all (if you have one MDS, your standby replay
daemon will always pick up that rank).

This is our configuration for mds(es):

[mds.rccephmds]
host = rccephmds
mds standby replay = true

[mds.rccephmds2]
host = rccephmds2
mds standby_for_rank = rccephmds
mds standby replay = true

At the time we deployed these servers, I set up 'standby_for_rank' 
because my understanding was that we had to specify the mds rank we 
would the standby-replay mds to follow (replay its journal and keep a 
warm cash).

From you comment, I understand that:
- My current config has the potential to trigger the issue mention above;
- However, since I only have one active mds, this config is unnecessary: 
the standby-replay mds will start replaying the journal of that (single) 
active mds rank. So, if I simple comment the 'standby_for_rank' config,  
i would be safe and out of the problematic code.

Can you just give a last confirmation word if my conclusions are correct?

Thanks for the help (as always)

Cheers
Goncalo

John

I just want to be sure that I won't hit a similar issue:
- In what exact circumstances is this problem triggered?
- Is it triggered when you add a brand new standby-replay mds (new IP, new
key)? I am hopping that in my case, I shouldn't be affected.

TIA
Goncalo

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com