Re: reconfiguring existing hardware for ceph use

Gregory Farnum <greg@xxxxxxxxxxx> · Fri, 26 Oct 2012 11:40:09 -0700

On Thu, Oct 25, 2012 at 1:40 PM, Jonathan Proulx <jon@xxxxxxxxxxxxx> wrote:
> Hi All,
>
> I have 8 servers available to test ceph on which are a bit over
> powered/under-disked and I'm trying to develop a plan for how to lay
> out services and how to populate the available disk slots.
>
> The hardware is dual socket Intel E5640 chips (8 core total/ node) with 48G
> RAM, dual 10G ethernet but only four 3.5" SAS slots (with Fusion-MPT
> controller).
>
> Target application is primarily RBD as volume storage backend for
> openstack (folsom cinder), and possibly using as object store for
> glance.  I'd also like to test CephFS, but don't have a particular
> usecase in mind for it.
>
> The openstack cloud this would back end is used for reasearch
> computing by a variety of internal research groups and has wildly
> unpredictable work loads.  Volume storage use has not been
> particularly intensive to date so I don't have a particular
> performance point to hit.
>
> Comparatively the current back end is a single cinder-volume server
> bpacing volumes on two software raid6 volumes each backed by 12 2T
> nearline SAS drives.  Another option we're evaluating is a Dell
> EqualLogic san with a mirrored pair of 16x1T drive raid6 units.
>
> My first though is to populate the test systesm with a single solid
> state drive (not sure size or type) to hold the operating system and
> journals and three 3T SAS drives of the OSD data filesystems.  Running
> 3 osd on all nodes (one per data disk), with mon and mds only on the
> first 3.

That should be fairly balanced — most modern SSDs can handle (more
than) three streams and 300-500 MB/s, which is roughly what an SAS
drive can handle in streaming writes. Presumably your OS won't
actually be doing much access to disk once booted.

> My second though is to use 3T drives in all slots.  Take the os cut
> off the top of each (probably 16G each assmbled as software raid10 for
> 32G or mirrored space) and run 4 osd per node on the remaining disk
> space using internal journals.

Whereas this of course provides more space. I'm not so sure that you'd
want to take a cut out of each OSD though — probably just taking it
out of one OSD and weighting that one lower than the others would make
more sense. Then placing the journals as either a file or partition on
each OSD's disk. This should localize the expense of seeks a bit more,
which I intuitively suspect will produce better results. But somebody
with a more data-driven intuition than mine might disagree?

> Is either more sane than the other? Are both so crazy I should just use
> an os disk and three osd disks with internal journals? have any better
> suggestions?

Basically you want to consider whether you need more storage or better
bandwidth and burst IOPs. Since OSDs use journaling for all writes,
then (*hands waving wildly*) your burst random IOPs can often be two
or three times what you'll actually get out of the backing disks
alone, which can be quite useful for some applications.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html