SSD journal deployment experiences

martin@xxxxxxxxxxx (Martin B Nielsen) · Thu, 4 Sep 2014 22:07:01 +0200

Hi Dan,

We took a different approach (and our cluster is tiny compared to many
others) - we have two pools; normal and ssd.

We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2 ssd
for OS/journals. We partitioned the two OS ssd as raid1 using about half
the space for the OS and leaving the rest on each for 2x journals and
unprovisioned. We've partitioned the OS disks to each hold 2x platter
journals. On top of that our ssd pooled disks also hold 2x journals; their
own + an additional from a platter disk. We have 8 osd-nodes.

So whenever an ssd fail we lose 2 osd (but never more).

We've had this system in production for ~1? year now and so far we've had 1
ssd and 2 platter disk fail. We run a couple of hundred vm-guests on it and
use ~60TB.

On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high usage.
The times we lost disks we hardly noticed. All SSD (OS included) have a
general utilization of <5%, platter disks near 10%.

We did a lot of initial testing about putting journals on the OS-ssd as
well extra on the ssd-osd, but we didn't find much difference or high
latencies as others have experienced. When/if we notice otherwise we'll
prob. switch to pure ssd as journalholders.

We originally deployed using saltstack and even though we have automated
replacing disks we still do it manually 'just to be sure'. It takes 5-10min
to replace an old disk and get it backfilling, so I don't expect us to
spend any time automating this.

Recovering 2 disks at once for us takes a long time but we've intentionally
set backfilling low and it is not noticeable on the cluster when it happens.

Anyways, we have pretty low cluster usage but in our experience ssd seem to
handle the constant load very well.

Cheers,
Martin

On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster <daniel.vanderster at cern.ch>
wrote:

> Dear Cephalopods,
>
> In a few weeks we will receive a batch of 200GB Intel DC S3700?s to
> augment our cluster, and I?d like to hear your practical experience and
> discuss options how best to deploy these.
>
> We?ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so
> they will become 20 OSDs + 4 SSDs per server. Until recently I?ve been
> planning to use the traditional deployment: 5 journal partitions per SSD.
> But as SSD-day approaches, I growing less comfortable with the idea of 5
> OSDs going down every time an SSD fails, so perhaps there are better
> options out there.
>
> Before getting into options, I?m curious about real reliability of these
> drives:
>
> 1) How often are DC S3700's failing in your deployments?
> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
> backfilling which results from an SSD failure? Have you considered tricks
> like increasing the down out interval so backfilling doesn?t happen in this
> case (leaving time for the SSD to be replaced)?
>
> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or
> RAID10 for the journals? If so, are you using the raw block devices or
> formatting it and storing the journals as files on the SSD array(s)? Recent
> discussions seem to indicate that XFS is just as fast as the block dev,
> since these drives are so fast.
>
> Next, I wonder how people with puppet/chef/? are handling the
> creation/re-creation of the SSD devices. Are you just wiping and rebuilding
> all the dependent OSDs completely when the journal dev fails? I?m not keen
> on puppetizing the re-creation of journals for OSDs...
>
> We also have this crazy idea of failing over to a local journal file in
> case an SSD fails. In this model, when an SSD fails we?d quickly create a
> new journal either on another SSD or on the local OSD filesystem, then
> restart the OSDs before backfilling started. Thoughts?
>
> Lastly, I would also consider using 2 of the SSDs in a data pool (with the
> other 2 SSDs to hold 20 journals ? probably in a RAID1 to avoid backfilling
> 10 OSDs when an SSD fails). If the 10-1 ratio of SSDs would perform
> adequately, that?d give us quite a few SSDs to build a dedicated high-IOPS
> pool.
>
> I?d also appreciate any other suggestions/experiences which might be
> relevant.
>
> Thanks!
> Dan
>
> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140904/b02a1953/attachment.htm>