On Sat, Jul 2, 2011 at 6:30 AM, Wilfrid Allembrand <wilfrid.allembrand@xxxxxxxxx> wrote: > Hi everyone, > > I'm trying to figure out what is the best OSD solution with an > infrastructure made up of servers with a lot a disks in each. Say, for > exemple you have 4+ nodes like Sun Fire X4500 (code-named Thumper). > Each node is 48 disks. > > What are the pros and cons to build a ceph cluster with btrfs on that > kind of high density hardware, considering the different scenarios for > each server : > - 1 OSD daemon per disk, so 48 osd daemons per server > - make 3 btrfs pools of 16 disks, so 3 osd daemons per server > - make 3 raid 5 or 6 volumes, so 3 osd daemons per server As Wido said, you certainly can run in groups. I think what we're seeing so far is that you want more on the order of a half-core per cosd (it will usually take much more than that, but CPU usage can get higher during group recovery situations) than a tenth of a core. :) More generally though, there just aren't enough large long-lived clusters for us to have the relevant experience to know what's best. :( > But as the data will be replicated to another node, we can run the > recovery in background, isn't it ? Does the recovery occurs when we > replace the failed node with a new valid one or does it occur on the > "surviving nodes of the cluster" immediatly after the failure ? > Perhaps we could be able to set the recovery/smartfail process with a > priority (low/normal/high) and thus controlling the CPU+IO impact ? Yes, recovery does run in the background, but he's talking about the sheer amount of data to transfer. If you have 48 OSDs/node (which I think is probably too many) and you lose a disk, that's 1 disk that needs to be transferred across the network. If you lose 2, that's 2 disks. If you're running a btrfs pool of 12 disks and you lose a disk, that's 12 disks that need to be transferred. If you're running a 12-disk RAID and lose 2 disks (or 3 for RAID6, I guess), that's 12 disks that need to be transferred. How you judge these risks versus the cost of running more daemons is up to you. I think right now I'd create 2 OSDs/core you have and split the disks up evenly between them on either a RAID array or btrfs pool, depending on how cutting-edge you want to be. :) But that's my judgement and isn't based on any risk calculations or whatever. -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html