Re: Servers with 40+ disks : to group or not to group ?

Wilfrid Allembrand <wilfrid.allembrand@xxxxxxxxx> · Wed, 6 Jul 2011 11:14:43 +0200

Thanks Greg and Wido for those highlights. Very helpfull !

Cheers,
Wilfrid

2011/7/5 Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx>:
> On Sat, Jul 2, 2011 at 6:30 AM, Wilfrid Allembrand
> <wilfrid.allembrand@xxxxxxxxx> wrote:
>> Hi everyone,
>>
>> I'm trying to figure out what is the best OSD solution with an
>> infrastructure made up of servers with a lot a disks in each. Say, for
>> exemple you have 4+ nodes like Sun Fire X4500 (code-named Thumper).
>> Each node is 48 disks.
>>
>> What are the pros and cons to build a ceph cluster with btrfs on that
>> kind of high density hardware, considering the different scenarios for
>> each server :
>> - 1 OSD daemon per disk, so 48 osd daemons per server
>> - make 3 btrfs pools of 16 disks, so 3 osd daemons per server
>> - make 3 raid 5 or 6 volumes, so 3 osd daemons per server
> As Wido said, you certainly can run in groups. I think what we're
> seeing so far is that you want more on the order of a half-core per
> cosd (it will usually take much more than that, but CPU usage can get
> higher during group recovery situations) than a tenth of a core. :)
> More generally though, there just aren't enough large long-lived
> clusters for us to have the relevant experience to know what's best.
> :(
>
>> But as the data will be replicated to another node, we can run the
>> recovery in background, isn't it ? Does the recovery occurs when we
>> replace the failed node with a new valid one or does it occur on the
>> "surviving nodes of the cluster" immediatly after the failure ?
>> Perhaps we could be able to set the recovery/smartfail process with a
>> priority (low/normal/high) and thus controlling the CPU+IO impact ?
> Yes, recovery does run in the background, but he's talking about the
> sheer amount of data to transfer. If you have 48 OSDs/node (which I
> think is probably too many) and you lose a disk, that's 1 disk that
> needs to be transferred across the network. If you lose 2, that's 2
> disks. If you're running a btrfs pool of 12 disks and you lose a disk,
> that's 12 disks that need to be transferred. If you're running a
> 12-disk RAID and lose 2 disks (or 3 for RAID6, I guess), that's 12
> disks that need to be transferred.
>
> How you judge these risks versus the cost of running more daemons is
> up to you. I think right now I'd create 2 OSDs/core you have and split
> the disks up evenly between them on either a RAID array or btrfs pool,
> depending on how cutting-edge you want to be. :) But that's my
> judgement and isn't based on any risk calculations or whatever.
> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html