Re: ceph replication and data redundancy

Ulysse 31 <ulysse31@xxxxxxxxx> · Mon, 21 Jan 2013 09:14:25 +0100

Hi everybody,

In fact, i found searching the doc on section "adding/removing a
monitor", infos about the paxos system used for quorum establishment.
Following the documentation, in a catastrophy scenario, i need to
remove the other monitors configured on the other buildings.
For better efficiency, i think i'll keep 1 monitor per building, and,
if two other building fails, i will delete those two monitors from the
configuration in order to access data again.
I'll simulate that and see if it goes well.
Thanks for your help and advices.

Regards,

--
Gomes do Vale Victor
System, Network and Security engineer.

2013/1/20 Gregory Farnum <greg@xxxxxxxxxxx>:
> (Sorry for the blank email just now, my client got a little eager!)
>
> Apart from the things that Wido has mentioned, you say you've set up 4 nodes and each one has a monitor on it. That's why you can't do anything when you bring down two nodes — the monitor cluster requires a strict majority in order to continue operating, which is why we recommend odd numbers. If you set up a different node as a monitor (simulating one in a different data center) and then bring down two nodes, things should keep working.
> -Greg
>
>
> On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote:
>
>> Hi,
>>
>> On 01/17/2013 10:55 AM, Ulysse 31 wrote:
>> > Hi all,
>> >
>> > I'm not sure if it's the good mailing, if not, sorry for that, tell me
>> > the appropriate one, i'll go for it.
>> > Here is my actual project :
>> > The company i work for has several buildings, each of them are linked
>> > with gigabit trunk links allowing us to have multiple machines over
>> > the same lan on different buildings.
>> > We need to archive some data (over 5 to 10Tb), but we want that data
>> > present on each buildings, and, in case of the lost of a building
>> > (catastrophy scenario) we steel have the data.
>> > Rather than using simple storage machines sync'ed by rsync, we thaught
>> > re-using older desktop machines we have in stock, and make a
>> > clusterized fs on it :
>> > In fact, speed is clearly not the goal of this data storage, we would
>> > just store old projects on it sometimes, and will access it in rare
>> > cases. the most important is to keep that data archived somewhere.
>>
>>
>>
>> Ok, keep that in mind. All writes to RADOS are synchronous, so if you
>> experience high latency or some congestion on your network Ceph will
>> become slow.
>>
>> > I was interrested by ceph in the way that we can declare, using the
>> > crush-map, a hierarchical maner to place replicated data.
>> > So for a test, i build a sample cluster composed of 4 nodes, installed
>> > under debian squeeze and actual bobtail stable version of ceph.
>> > On my sample i wanted to simulate 2 "per buildings" nodes, each nodes
>> > has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but
>> > that just a sample), osd uses xfs on /dev/sda3, and made a crush map
>> > like :
>> > ---
>> > # begin crush map
>> >
>> > # devices
>> > device 0 osd.0
>> > device 1 osd.1
>> > device 2 osd.2
>> > device 3 osd.3
>> >
>> > # types
>> > type 0 osd
>> > type 1 host
>> > type 2 rack
>> > type 3 row
>> > type 4 room
>> > type 5 datacenter
>> > type 6 root
>> >
>> > # buckets
>> > host server-0 {
>> > id -2 # do not change unnecessarily
>> > # weight 1.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item osd.0 weight 1.000
>> > }
>> > host server-1 {
>> > id -5 # do not change unnecessarily
>> > # weight 1.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item osd.1 weight 1.000
>> > }
>> > host server-2 {
>> > id -6 # do not change unnecessarily
>> > # weight 1.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item osd.2 weight 1.000
>> > }
>> > host server-3 {
>> > id -7 # do not change unnecessarily
>> > # weight 1.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item osd.3 weight 1.000
>> > }
>> > rack bat0 {
>> > id -3 # do not change unnecessarily
>> > # weight 3.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item server-0 weight 1.000
>> > item server-1 weight 1.000
>> > }
>> > rack bat1 {
>> > id -4 # do not change unnecessarily
>> > # weight 3.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item server-2 weight 1.000
>> > item server-3 weight 1.000
>> > }
>> > root root {
>> > id -1 # do not change unnecessarily
>> > # weight 3.000
>> > alg straw
>> > hash 0 # rjenkins1
>> > item bat0 weight 3.000
>> > item bat1 weight 3.000
>> > }
>> >
>> > # rules
>> > rule data {
>> > ruleset 0
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take root
>> > step chooseleaf firstn 0 type rack
>> > step emit
>> > }
>> > rule metadata {
>> > ruleset 1
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take root
>> > step chooseleaf firstn 0 type rack
>> > step emit
>> > }
>> > rule rbd {
>> > ruleset 2
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take root
>> > step chooseleaf firstn 0 type rack
>> > step emit
>> > }
>> > # end crush map
>> > ---
>> >
>> > Using this crush-map, coupled with a default pool data size 2
>> > (replication 2), allowed me to be sure to have duplicate of all data
>> > on both "sample building" bat0 and bat1.
>> > Then I mounted on a client using ceph-fuse using : ceph-fuse -m
>> > server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything
>> > works fine has expected, can write/read data, from one or more
>> > clients, no probs on that.
>>
>>
>>
>> Just to repeat. CephFS is still in development and can be buggy sometimes.
>>
>> Also, if you do this, make sure you have an Active/Standby MDS setup
>> where each building has an MDS.
>>
>> > Then I begin stress tests, i simulate the lost of one node, no problem
>> > on that, still can access to the cluster data.
>> > Finally i simulate the lost of a building (bat0), bringing down
>> > server-0 and server-1. the results was an hang on the cluster, no more
>> > access to any data ... ceph -s on the active nodes hanging with :
>> >
>> > 2013-01-17 09:14:18.327911 7f4e5ca70700 0 -- xxx.xxx.xxx.52:0/16543
>> > > > xxx.xxx.xxx.51:6789/0 pipe(0x2c9d490 sd=3 :0 pgs=0 cs=0 l=1).fault
>> > >
>> >
>> >
>> >
>> > I start search the net and might have found the answer, the problem
>> > came from the fact that my rules uses "step chooseleaf firstn 0 type
>> > rack", which, allows me in fact to have data replicated on both
>> > buildings, but seems to hang if a building is missing ...
>> > I know that actually geo - replication is currently under development,
>> > but is there a way to do what i'm trying to do without it ?
>> > Thanks for your help and answers.
>>
>>
>>
>> Pools nowadays have a "min_size", if their replicas go under that they
>> become incomplete and don't work.
>>
>> You have to set this to 1 for your 'data' en 'metadata' pool:
>>
>> osd pool data set min_size 1
>> osd pool metadata set min_size 1
>>
>> You might want to test this with plain RADOS instead of the filesystem,
>> just to be sure.
>>
>> Try creating a new pool and use the 'rados' tool to write some data and
>> see if it works when you bring a building down.
>>
>> Wido
>>
>> > Best Regards,
>> >
>> >
>> >
>> >
>> >
>> > --
>> > Gomes do Vale Victor
>> > System, Network and Security Engineer
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
>> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html