Hi everybody, In fact, i found searching the doc on section "adding/removing a monitor", infos about the paxos system used for quorum establishment. Following the documentation, in a catastrophy scenario, i need to remove the other monitors configured on the other buildings. For better efficiency, i think i'll keep 1 monitor per building, and, if two other building fails, i will delete those two monitors from the configuration in order to access data again. I'll simulate that and see if it goes well. Thanks for your help and advices. Regards, -- Gomes do Vale Victor System, Network and Security engineer. 2013/1/20 Gregory Farnum <greg@xxxxxxxxxxx>: > (Sorry for the blank email just now, my client got a little eager!) > > Apart from the things that Wido has mentioned, you say you've set up 4 nodes and each one has a monitor on it. That's why you can't do anything when you bring down two nodes — the monitor cluster requires a strict majority in order to continue operating, which is why we recommend odd numbers. If you set up a different node as a monitor (simulating one in a different data center) and then bring down two nodes, things should keep working. > -Greg > > > On Sunday, January 20, 2013 at 9:29 AM, Wido den Hollander wrote: > >> Hi, >> >> On 01/17/2013 10:55 AM, Ulysse 31 wrote: >> > Hi all, >> > >> > I'm not sure if it's the good mailing, if not, sorry for that, tell me >> > the appropriate one, i'll go for it. >> > Here is my actual project : >> > The company i work for has several buildings, each of them are linked >> > with gigabit trunk links allowing us to have multiple machines over >> > the same lan on different buildings. >> > We need to archive some data (over 5 to 10Tb), but we want that data >> > present on each buildings, and, in case of the lost of a building >> > (catastrophy scenario) we steel have the data. >> > Rather than using simple storage machines sync'ed by rsync, we thaught >> > re-using older desktop machines we have in stock, and make a >> > clusterized fs on it : >> > In fact, speed is clearly not the goal of this data storage, we would >> > just store old projects on it sometimes, and will access it in rare >> > cases. the most important is to keep that data archived somewhere. >> >> >> >> Ok, keep that in mind. All writes to RADOS are synchronous, so if you >> experience high latency or some congestion on your network Ceph will >> become slow. >> >> > I was interrested by ceph in the way that we can declare, using the >> > crush-map, a hierarchical maner to place replicated data. >> > So for a test, i build a sample cluster composed of 4 nodes, installed >> > under debian squeeze and actual bobtail stable version of ceph. >> > On my sample i wanted to simulate 2 "per buildings" nodes, each nodes >> > has a 2Tb disk and has mon/osd/mds (i know it is not optimized, but >> > that just a sample), osd uses xfs on /dev/sda3, and made a crush map >> > like : >> > --- >> > # begin crush map >> > >> > # devices >> > device 0 osd.0 >> > device 1 osd.1 >> > device 2 osd.2 >> > device 3 osd.3 >> > >> > # types >> > type 0 osd >> > type 1 host >> > type 2 rack >> > type 3 row >> > type 4 room >> > type 5 datacenter >> > type 6 root >> > >> > # buckets >> > host server-0 { >> > id -2 # do not change unnecessarily >> > # weight 1.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item osd.0 weight 1.000 >> > } >> > host server-1 { >> > id -5 # do not change unnecessarily >> > # weight 1.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item osd.1 weight 1.000 >> > } >> > host server-2 { >> > id -6 # do not change unnecessarily >> > # weight 1.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item osd.2 weight 1.000 >> > } >> > host server-3 { >> > id -7 # do not change unnecessarily >> > # weight 1.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item osd.3 weight 1.000 >> > } >> > rack bat0 { >> > id -3 # do not change unnecessarily >> > # weight 3.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item server-0 weight 1.000 >> > item server-1 weight 1.000 >> > } >> > rack bat1 { >> > id -4 # do not change unnecessarily >> > # weight 3.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item server-2 weight 1.000 >> > item server-3 weight 1.000 >> > } >> > root root { >> > id -1 # do not change unnecessarily >> > # weight 3.000 >> > alg straw >> > hash 0 # rjenkins1 >> > item bat0 weight 3.000 >> > item bat1 weight 3.000 >> > } >> > >> > # rules >> > rule data { >> > ruleset 0 >> > type replicated >> > min_size 1 >> > max_size 10 >> > step take root >> > step chooseleaf firstn 0 type rack >> > step emit >> > } >> > rule metadata { >> > ruleset 1 >> > type replicated >> > min_size 1 >> > max_size 10 >> > step take root >> > step chooseleaf firstn 0 type rack >> > step emit >> > } >> > rule rbd { >> > ruleset 2 >> > type replicated >> > min_size 1 >> > max_size 10 >> > step take root >> > step chooseleaf firstn 0 type rack >> > step emit >> > } >> > # end crush map >> > --- >> > >> > Using this crush-map, coupled with a default pool data size 2 >> > (replication 2), allowed me to be sure to have duplicate of all data >> > on both "sample building" bat0 and bat1. >> > Then I mounted on a client using ceph-fuse using : ceph-fuse -m >> > server-2:6789 /mnt/mycephfs (server-2 located on bat1), everything >> > works fine has expected, can write/read data, from one or more >> > clients, no probs on that. >> >> >> >> Just to repeat. CephFS is still in development and can be buggy sometimes. >> >> Also, if you do this, make sure you have an Active/Standby MDS setup >> where each building has an MDS. >> >> > Then I begin stress tests, i simulate the lost of one node, no problem >> > on that, still can access to the cluster data. >> > Finally i simulate the lost of a building (bat0), bringing down >> > server-0 and server-1. the results was an hang on the cluster, no more >> > access to any data ... ceph -s on the active nodes hanging with : >> > >> > 2013-01-17 09:14:18.327911 7f4e5ca70700 0 -- xxx.xxx.xxx.52:0/16543 >> > > > xxx.xxx.xxx.51:6789/0 pipe(0x2c9d490 sd=3 :0 pgs=0 cs=0 l=1).fault >> > > >> > >> > >> > >> > I start search the net and might have found the answer, the problem >> > came from the fact that my rules uses "step chooseleaf firstn 0 type >> > rack", which, allows me in fact to have data replicated on both >> > buildings, but seems to hang if a building is missing ... >> > I know that actually geo - replication is currently under development, >> > but is there a way to do what i'm trying to do without it ? >> > Thanks for your help and answers. >> >> >> >> Pools nowadays have a "min_size", if their replicas go under that they >> become incomplete and don't work. >> >> You have to set this to 1 for your 'data' en 'metadata' pool: >> >> osd pool data set min_size 1 >> osd pool metadata set min_size 1 >> >> You might want to test this with plain RADOS instead of the filesystem, >> just to be sure. >> >> Try creating a new pool and use the 'rados' tool to write some data and >> see if it works when you bring a building down. >> >> Wido >> >> > Best Regards, >> > >> > >> > >> > >> > >> > -- >> > Gomes do Vale Victor >> > System, Network and Security Engineer >> > -- >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx) >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx (mailto:majordomo@xxxxxxxxxxxxxxx) >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html