Re: pros/cons of multiple OSD's per host

Nick Tan <nick.tan@xxxxxxxxx> · Mon, 21 Aug 2017 14:15:10 +0800

On Mon, Aug 21, 2017 at 1:58 PM, Christian Balzer <chibi@xxxxxxx> wrote:
On Mon, 21 Aug 2017 13:40:29 +0800 Nick Tan wrote:

> Hi all,

>

> I'm in the process of building a ceph cluster, primarily to use cephFS.  At

> this stage I'm in the planning phase and doing a lot of reading on best

> practices for building the cluster, however there's one question that I

> haven't been able to find an answer to.

>

> Is it better to use many hosts with single OSD's, or fewer hosts with

> multiple OSD's?  I'm looking at using 8 or 10TB HDD's as OSD's and hosts

> with up to 12 HDD's.  If a host dies, that means up to 120TB of data will

> need to be recovered if the host has 12 x 10TB HDD's.  But if smaller hosts

> with single HDD's are used then a single host failure will result in only a

> maximum of 10TB to be recovered, so in this case it looks better to use

> smaller hosts with single OSD's if the failure domain is the host.

>

> Are there other benefits or drawbacks of using many small servers with

> single OSD's vs fewer large servers with lots of OSD's?

>

Ideally you'll have smallish hosts with smallish disk (not 10TB monsters),

both to reduce the impact an OSD or host loss would have as well as

improving IOPS (more spindles).

With larger hosts you'll also want to make sure that a single host failure

is not going to create a "full" (and thus unusable) cluster, besides the

I/O strain recovery will cause.

5 or 10 hosts are a common, typical starting point.

Also important to remember is the configuration parameter

"mon_osd_down_out_subtree_limit = host"

since repairing a large host is likely to be faster that replicating all

the data it held.

Of course "ideally" tends to mean "expensive" in most cases and this is

no exception.

Smaller hosts are more expensive in terms of space and parts (a NIC for

each OSD instead of one per 12, etc).

And before you mention really small hosts with 1GbE NICs, the latency

penalty is significant there, the limitation to 100MB/s is more of an

issue with reads than writes.

Penultimately you need to balance your budget, rack space and needs.

Thanks Christian.  The tip about the "mon_osd_down_out_subtree_limit = host" setting is very useful.  If we go down the path of large servers (12+ disks), my intention is to have a spare empty chassis so in the case of a server failure I could move the disks into the spare chassis and bring it back online which would be done much faster than trying to recover 12 OSD's.  That was my main concern with the large servers which this helps alleviate.  Thanks!

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com