SSD journal deployment experiences

robert@xxxxxxxxxxxxx (Robert LeBlanc) · Thu, 4 Sep 2014 10:42:23 -0600

We are still pretty early on in our testing of how to best use SSDs as
well. What we are trying right now, for some of the reasons you mentioned
already, is to use bcache as a cache for both journal and data. We have 10
spindles in our boxes with 2 SSDs. We created two bcaches (one for each
SSD) and put five spindles behind it with the journals as just files on the
spindle (because it is hot, it should stay in SSD). This should have the
advantage that if the SSD fails, it could automatically fail to
write-through mode (although I don't think it will help if the SSD suddenly
fails). However, it seems that if any part of the journal is lost, the OSD
is toast and needs to be rebuilt. Bcache was appealing to us because one
SSD could front multiple backend disks and make the most efficient use of
the SSD, it also has write around for large sequential writes so that cache
is not evicted for large sequential writes which spindles are good at.
Since we have a high read cache hit from KVM and other layers, this is
primary intended to help accelerate writes more than reads (we are also
more write heavy in our environment).

So far it seems to help, but we are going to start more in-depth testing
soon. One drawback is that bcache devices don't seem to like partitions, so
we have created the OSDs manually instead if using ceph-deploy.

I too am interested with other's experience with SSD and trying to
cache/accelerate Ceph. I think the Caching pool in the long run will be the
best option, but it can still use some performance tweaking with small
reads before it will be really viable for us.

Robert LeBlanc

On Thu, Sep 4, 2014 at 10:21 AM, Dan Van Der Ster <daniel.vanderster at cern.ch
> wrote:

> Dear Cephalopods,
>
> In a few weeks we will receive a batch of 200GB Intel DC S3700?s to
> augment our cluster, and I?d like to hear your practical experience and
> discuss options how best to deploy these.
>
> We?ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so
> they will become 20 OSDs + 4 SSDs per server. Until recently I?ve been
> planning to use the traditional deployment: 5 journal partitions per SSD.
> But as SSD-day approaches, I growing less comfortable with the idea of 5
> OSDs going down every time an SSD fails, so perhaps there are better
> options out there.
>
> Before getting into options, I?m curious about real reliability of these
> drives:
>
> 1) How often are DC S3700's failing in your deployments?
> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is the
> backfilling which results from an SSD failure? Have you considered tricks
> like increasing the down out interval so backfilling doesn?t happen in this
> case (leaving time for the SSD to be replaced)?
>
> Beyond the usually 5 partitions deployment, is anyone running a RAID1 or
> RAID10 for the journals? If so, are you using the raw block devices or
> formatting it and storing the journals as files on the SSD array(s)? Recent
> discussions seem to indicate that XFS is just as fast as the block dev,
> since these drives are so fast.
>
> Next, I wonder how people with puppet/chef/? are handling the
> creation/re-creation of the SSD devices. Are you just wiping and rebuilding
> all the dependent OSDs completely when the journal dev fails? I?m not keen
> on puppetizing the re-creation of journals for OSDs...
>
> We also have this crazy idea of failing over to a local journal file in
> case an SSD fails. In this model, when an SSD fails we?d quickly create a
> new journal either on another SSD or on the local OSD filesystem, then
> restart the OSDs before backfilling started. Thoughts?
>
> Lastly, I would also consider using 2 of the SSDs in a data pool (with the
> other 2 SSDs to hold 20 journals ? probably in a RAID1 to avoid backfilling
> 10 OSDs when an SSD fails). If the 10-1 ratio of SSDs would perform
> adequately, that?d give us quite a few SSDs to build a dedicated high-IOPS
> pool.
>
> I?d also appreciate any other suggestions/experiences which might be
> relevant.
>
> Thanks!
> Dan
>
> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140904/0b532afd/attachment.htm>