Re: ceph replication and data redundancy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



(Sorry for the blank email just now, my client got a little eager!)  

Apart from the things that Wido has mentioned, you say you've set up 4 nodes and each one has a monitor on it. That's why you can't do anything when you bring down two nodes — the monitor cluster requires a strict majority in order to continue operating, which is why we recommend odd numbers. If you set up a different node as a monitor (simulating one in a different data center) and then bring down two nodes, things should keep working.
-Greg


On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote:

> Hi,
>  
> On 01/17/2013 10:55 AM, Ulysse 31 wrote:
> > Hi all,
> >  
> > I'm not sure if it's the good mailing, if not, sorry for that, tell me
> > the appropriate one, i'll go for it.
> > Here is my actual project :
> > The company i work for has several buildings, each of them are linked
> > with gigabit trunk links allowing us to have multiple machines over
> > the same lan on different buildings.
> > We need to archive some data (over 5 to 10Tb), but we want that data
> > present on each buildings, and, in case of the lost of a building
> > (catastrophy scenario) we steel have the data.
> > Rather than using simple storage machines sync'ed by rsync, we thaught
> > re-using older desktop machines we have in stock, and make a
> > clusterized fs on it :
> > In fact, speed is clearly not the goal of this data storage, we would
> > just store old projects on it sometimes, and will access it in rare
> > cases. the most important is to keep that data archived somewhere.
>  
>  
>  
> Ok, keep that in mind. All writes to RADOS are synchronous, so if you  
> experience high latency or some congestion on your network Ceph will  
> become slow.
>  
> > I was interrested by ceph in the way that we can declare, using the
> > crush-map, a hierarchical maner to place replicated data.
> > So for a test, i build a sample cluster composed of 4 nodes, installed
> > under debian squeeze and actual bobtail stable version of ceph.
> > On my sample i wanted to simulate 2 "per buildings" nodes, each nodes
> > has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but
> > that just a sample), osd uses xfs on /dev/sda3, and made a crush map
> > like :
> > ---
> > # begin crush map
> >  
> > # devices
> > device 0 osd.0
> > device 1 osd.1
> > device 2 osd.2
> > device 3 osd.3
> >  
> > # types
> > type 0 osd
> > type 1 host
> > type 2 rack
> > type 3 row
> > type 4 room
> > type 5 datacenter
> > type 6 root
> >  
> > # buckets
> > host server-0 {
> > id -2 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.0 weight 1.000
> > }
> > host server-1 {
> > id -5 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.1 weight 1.000
> > }
> > host server-2 {
> > id -6 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.2 weight 1.000
> > }
> > host server-3 {
> > id -7 # do not change unnecessarily
> > # weight 1.000
> > alg straw
> > hash 0 # rjenkins1
> > item osd.3 weight 1.000
> > }
> > rack bat0 {
> > id -3 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item server-0 weight 1.000
> > item server-1 weight 1.000
> > }
> > rack bat1 {
> > id -4 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item server-2 weight 1.000
> > item server-3 weight 1.000
> > }
> > root root {
> > id -1 # do not change unnecessarily
> > # weight 3.000
> > alg straw
> > hash 0 # rjenkins1
> > item bat0 weight 3.000
> > item bat1 weight 3.000
> > }
> >  
> > # rules
> > rule data {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > rule metadata {
> > ruleset 1
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > rule rbd {
> > ruleset 2
> > type replicated
> > min_size 1
> > max_size 10
> > step take root
> > step chooseleaf firstn 0 type rack
> > step emit
> > }
> > # end crush map
> > ---
> >  
> > Using this crush-map, coupled with a default pool data size 2
> > (replication 2), allowed me to be sure to have duplicate of all data
> > on both "sample building" bat0 and bat1.
> > Then I mounted on a client using ceph-fuse using : ceph-fuse -m
> > server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything
> > works fine has expected, can write/read data, from one or more
> > clients, no probs on that.
>  
>  
>  
> Just to repeat. CephFS is still in development and can be buggy sometimes.
>  
> Also, if you do this, make sure you have an Active/Standby MDS setup  
> where each building has an MDS.
>  
> > Then I begin stress tests, i simulate the lost of one node, no problem
> > on that, still can access to the cluster data.
> > Finally i simulate the lost of a building (bat0), bringing down
> > server-0 and server-1. the results was an hang on the cluster, no more
> > access to any data ... ceph -s on the active nodes hanging with :
> >  
> > 2013-01-17 09:14:18.327911 7f4e5ca70700 0 -- xxx.xxx.xxx.52:0/16543
> > > > xxx.xxx.xxx.51:6789/0 pipe(0x2c9d490 sd=3 :0 pgs=0 cs=0 l=1).fault
> > >  
> >  
> >  
> >  
> > I start search the net and might have found the answer, the problem
> > came from the fact that my rules uses "step chooseleaf firstn 0 type
> > rack", which, allows me in fact to have data replicated on both
> > buildings, but seems to hang if a building is missing ...
> > I know that actually geo - replication is currently under development,
> > but is there a way to do what i'm trying to do without it ?
> > Thanks for your help and answers.
>  
>  
>  
> Pools nowadays have a "min_size", if their replicas go under that they  
> become incomplete and don't work.
>  
> You have to set this to 1 for your 'data' en 'metadata' pool:
>  
> osd pool data set min_size 1
> osd pool metadata set min_size 1
>  
> You might want to test this with plain RADOS instead of the filesystem,  
> just to be sure.
>  
> Try creating a new pool and use the 'rados' tool to write some data and  
> see if it works when you bring a building down.
>  
> Wido
>  
> > Best Regards,
> >  
> >  
> >  
> >  
> >  
> > --
> > Gomes do Vale Victor
> > System, Network and Security Engineer
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>  
>  
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx)
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux