Re: What would a good OSD node hardware configuration look like?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 07-11-12 02:35, Dennis Jacobfeuerborn wrote:
On 11/06/2012 08:30 PM, Josh Durgin wrote:
On 11/05/2012 06:49 PM, Dennis Jacobfeuerborn wrote:
On 11/06/2012 01:14 AM, Josh Durgin wrote:
On 11/05/2012 09:13 AM, Dennis Jacobfeuerborn wrote:
Hi,
I'm thinking about building a ceph cluster and I'm wondering what a good
configuration would look like for 4-8 (and maybe more) 2HU 8-disk or 3HU
16-disk systems.
Would it make sense to make each disk an individual OSD or should I
perhaps
create several raid-0 and create OSDs from those?

This mainly depends on your ratio of disks to cpu/ram. Generally we
recommend 1GB ram and 1Ghz per OSD. If you've got enough cpu/ram,
running 1 OSD/disk is pretty common. It makes recovering from a
single disk failure faster.

So basically a 2Ghz quad-core CPU and 8GB RAM would be sufficient for 8
OSDs?

Yes, although more RAM will be better (providing more page cache).

Also what is the best setup for the journal? If I understand it correctly
then each OSD needs its own journal and that should be a separate disk but
that would be quite wasteful it seems. Would it make sense to put in two
small SSD disks in a raid-1 configuration and create a filesystem for each
OSD journal on it?

This is certainly possible. It's a bit less overhead if you give each
osd it's own partition of the ssd(s) instead of going through another
filesystem.

I suspect it would be better to not use raid-1, since these ssds will be
receiving all the data the osds write as well. If they're in raid-1 instead
of being used independently, their lifetimes might be much
shorter.

My primary concern here is fault tolerance. What happens when the journal
disk dies? Can ceph cope with that and write directly to the OSDs or would
that mean that with a single shared disk for all OSDs a failure would mean
the entire system is effectively offline for ceph?

I'm going to point to some messages in the archives to avoid repetition:

http://permalink.gmane.org/gmane.comp.file-systems.ceph.devel/6377

How does the number of OSDs/Nodes affect the performance of say a
single dd
operation? Will blocks be distributed over the cluster and written/read in
parallel or does the number only improve concurrency rather than benefit
single threaded workloads?

In cephfs and rbd, objects are distributed over the cluster, but the
OSDs/node ratio doesn't really affect the performance. It's more
dependent on the workload and striping policy. For example, with
a small stripe size, small sequential writes will benefit from more
osds, but the number per node isn't particularly important.

By OSDs/Nodes I really meant "OSDs or nodes" and not the ratio. What I'm
trying to understand is if a) the number of nodes plays a significant role
when it comes to performance (e.g. a 4 node cluster with large disks vs. a
16 node cluster with smaller disks) and b) how much of an impact the number
of OSDs has on the cluster e.g. an 8 node cluster with each node being a
single OSD (with all disks as raid-0) vs. an 8 node cluster with say 64
OSDs (each node with 8 disks as individual OSDs).

Generally more smaller nodes will recover faster from a node or disk
failure than a few larger node, since the remaining OSDs recover in
parallel. There are some other advantages of many small nodes. Wido and
Stefan covered this well in this thread:

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/10212


So that sound like a raid-1 (or potentially a raid-10) is pretty much a
must when using a shared ssd disk for the journals for more than one OSD.
Without redundancy the failure of a single disk (the journal one) would
take down all OSDs on that node making a multi OSD per node setup pointless.


Except that SSDs will mainly fail due to the amount of write cycles they had to endure.

So in RAID-1 your SSDs will fail at almost the same time.

With for example 8 OSDs in a server you better spread them out 50/50 over two SSDs.

Wido

Regards,
   Dennis


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux