Hi, When hanging around the mailinglist i noticed that there are a lot of questions about Ceph and possible hardware setups. After reading http://ceph.newdream.net/wiki/Designing_a_cluster i've still got a lot of questions hanging around, so this is why i'm making this post. I in my situation i would like to run Ceph on the cheapest (best bang for buck) hardware available. Think about simple servers with 4 to 6 harddisks (desktop mainboards, cpu's and disks) and building Ceph on top of that. We want to skip the expensive RAID controllers, since they become obsolute when using Ceph and setting the replication on the desired level. Now we get to the OSD topic: * One cosd per disk? * Btrfs stripe accross these disks? * What about journaling? With a custom CRUSH map ( http://ceph.newdream.net/wiki/Custom_data_placement_with_CRUSH ) you can place data on strategic locations, in my situation i would create 5 pools with each 4 OSD's, where these 4 pools are all located in seperate 19" racks. In each rack i would hang: * 1 MON * 1 MDS * 4 OSD's Why 5 pools? This is because i would need a odd number of monitors. Yes, i could choose to place 3 monitors, but i would like to create a pool where al 6 machines are connected to the same switch. Is this reasonable? Or is that many monitors really overdone? Now, the OSD's all have 4 to 6 harddisks (but lets stick to 4), now i have the option to run an OSD for each harddisk, which would give me shorter recover times when a disk fails, but would give me extra configuration / administration. But i could also choose to make one btrfs stripe over these 4 disks and run one OSD. This would give me a higher recover time when a disk fails (since the whole stripe fails), but would keep my config smaller. In the first setup i would only benefit if i could replace the failed disk hot-swap. If not, i would have to bring the whole system down, which would take the other 3 OSD's with it, thus leaving my cluster with 4 less OSD's. I could buy more expensive hardware with hot-swap capabilities, but imho that is not really what i would like to do with Ceph. I'd prefer the situation where i'd stripe over all 4 disks, giving me and extra pro. In this situation i could configure my node to panic whenever a disk is starting to give errors, so my cluster can take over immediately. Am i right? Is this "the way to go"? Then there is the journaling topic. When creating a filesystem you get a big warning if the drive cache is enabled on the journaling partition. Imho you don't want to have a drive cache on your journal, but you do want to have one on your data partition. This forces you to use a seperate disk for your journaling. Assume that i would have 4 disks in a btrfs stripe, would a fifth disk for journaling only be sufficient? I assume so, since it only has to hold data for a few seconds. But how important is the journaling? If i choose to not use the journal, how big will my penalty be, for lets say a situation where most of the files will be small (webhosting / mailhosting usage). I hope someone could give a answer on these questions, which would clarify things for a lot of people. (And it would add an interesting message to the ml ;-) ) Note: I've read http://marc.info/?l=ceph-devel&m=126990365515892&w=2 before, my post is based on that thread. -- Met vriendelijke groet, Wido den Hollander Hoofd Systeembeheer / CSO Telefoon Support Nederland: 0900 9633 (45 cpm) Telefoon Support België: 0900 70312 (45 cpm) Telefoon Direct: (+31) (0)20 50 60 104 Fax: +31 (0)20 50 60 111 E-mail: support@xxxxxxxxxxxx Website: http://www.pcextreme.nl Kennisbank: http://support.pcextreme.nl/ Netwerkstatus: http://nmc.pcextreme.nl -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html