Active vs. standby MDSes (Was: Re: Degraded PGs blocking open()?)

Székelyi Szabolcs <szekelyi@xxxxxxx> · Thu, 9 Jun 2011 12:23:08 +0200

Hey Greg,

On 2011. June 7. 04:15:31 Gregory Farnum wrote:
> 2011/6/6 Székelyi Szabolcs <szekelyi@xxxxxxx>:
> > I have a three node ceph setup, two nodes playing all three roles (OSD,
> > MDS, MON), and one being just a monitor (which happens to be the client
> > I'm using the filesystem from).
> > 
> > I want to achieve high availablity by mirroring all data between the OSDs
> > and being able to still access everything even if one of them goes down.
> > The mirroring works fine, I see the space being consumed on both nodes
> > as I copy data on the file system. According to `ceph -s`, all PGs are
> > in active+clean state. If I start reading a big file and shut down one
> > of the (OSD+MDS+MON) nodes, the file can still be read until the end,
> > that's fine. Moreover, the contents read back seem correct when compared
> > to the original file. Very nice. But if I start reading the file while
> > one of the nodes is down, it blocks until the node comes up again. I
> > can't even kill the reading process with KILL, TERM, or INT.
> > 
> > Am I doing something wrong, or was not careful enough reading the docs,
> > or may this be a bug? My ceph.conf is attached.
> 
> The problem isn't in the OSD, it's the MDS. :)
> 
> The MDS system is *slightly* less resilient than the OSD system is.
> You can set up "standby" MDSes that will take over if the system
> detects that an MDS has died; you can even set up "standby-replay"
> MDSes that follow a specific MDS and keep all its data cached in
> memory so they can take over right when a failure is detected. But if
> you lose one MDS its data won't automatically be imported into the
> remaining MDSes. (Because the MDS keeps all its data on the OSDs,
> there's no danger of losing data -- it's a matter of how the data is
> segregated that requires a new daemon. And generally the process is
> dominated by the timeout, not the time it takes the new MDS to take
> over.)

Thanks for the clarification. I still have a few questions.

If I understand things correctly, Ceph tries to have max_mds number of MDSes 
active at all times. I can have more MDSes than this number, but the excess 
ones will be standby MDSes, right?

I can't really understand the difference between a standby and an active MDS. 
Now I have two active and no standby MDSes, and the filesystem stops working if 
I kill any of them. Does this mean that the system will stop working if it 
can't fill up the number of MDSes to max_mds from the standby pool?

What is the reason for running standby MDSes and not setting max_mds to the 
number of all MDSes?

> So in your case, you're trying to open a file that is controlled by
> the MDS that you killed, and the client can't get the "capability"
> bits that it needs in order to look at the file. So you've got a few
> options:
> 1) Kill the OSD, but not the MDS.

Well, if a machine crashes, then both fall victim. :(

> 2) Create an extra MDS daemon, perhaps on your monitor node. When the
> system detects that one of your MDSes has died (a configurable
> timeout, IIRC in the neighborhood of 30-60 seconds), this extra daemon
> will take over.

I will do this. Should this be a standby or an active MDS? Ie. should I 
increase max_mds from 2 to 3 after creating the new MDS?

> (Or you can just start up the new daemon after you kill the old one,
> doesn't matter.)
> 3) Create a new system with only one MDS and don't kill that one.
> (Eventually you will be able to shrink the number of MDSes, but this
> isn't well-tested or documented so I'm not sure what state it's in
> right now.)

This is not an option since it will create a SPOF and that's exactly the thing 
I'm trying to avoid by using Ceph.

Thanks,
-- 
cc

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html