Hi, thanks Wido for your answers. 2011/7/4 Wido den Hollander <wido@xxxxxxxxx>: > Hi, > > On Sat, 2011-07-02 at 15:30 +0200, Wilfrid Allembrand wrote: >> Hi everyone, >> >> I'm trying to figure out what is the best OSD solution with an >> infrastructure made up of servers with a lot a disks in each. Say, for >> exemple you have 4+ nodes like Sun Fire X4500 (code-named Thumper). >> Each node is 48 disks. >> >> What are the pros and cons to build a ceph cluster with btrfs on that >> kind of high density hardware, considering the different scenarios for >> each server : >> - 1 OSD daemon per disk, so 48 osd daemons per server > > That would be the best option in terms of available storage, you would > get the maximum available storage, but you would need a LOT of RAM and > CPU power. > > I'm running 10 nodes with 4 OSD's each on Atoms with 4GB of RAM, that is > pretty heavy for those machines, especially when you start to have a lot > of PGs (Placement Groups) and objects. Recovery then start to take a lot > of time and memory. Yes, I think I'll give it a try with RAM between 24 and 48G per server. But how about CPU ? I guess a mobo with 2 sockets should be enough (let's say with 2 or cores per socket). >> - make 3 btrfs pools of 16 disks, so 3 osd daemons per server >> - make 3 raid 5 or 6 volumes, so 3 osd daemons per server > > You could try making btrfs pools of 12 or 16 disks, whatever you like, > but you would then add a SPoF, if for some reason btrfs fails (bugs or > so) you could loose a lot of data, recovering that could saturate the > rest of your cluster. > > Using software RAID is a second option, but still, adding even an extra > layer? > > Running less OSD's would mean less memory overhead, but if it really > matters? I'm not sure. The more data and PGs you start to add, the more > it will start to stress your OSDs. > > The number of PGs is influenced by the number of OSDs, so running less > OSDs means less PGs, but how much of a difference it makes? Not sure. > > >> >> From a performance and management point of view, will you recommand a >> lot of small servers of a few numbers of thumper's like servers ? > > >From what I know, get a lot of small machines with lets say 4 to 8 > disks. If one fails the impact on the cluster will be much smaller and > recovery will take less time. > > Think about it, you have 3 "thumpers" with each 48TB of storage and one > fails, that is going to be a heavy recovery. But as the data will be replicated to another node, we can run the recovery in background, isn't it ? Does the recovery occurs when we replace the failed node with a new valid one or does it occur on the "surviving nodes of the cluster" immediatly after the failure ? Perhaps we could be able to set the recovery/smartfail process with a priority (low/normal/high) and thus controlling the CPU+IO impact ? > Wido > >> >> All the best, >> Wilfrid >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html