SSD journal deployment experiences

martin@xxxxxxxxxxx (Martin B Nielsen) · Fri, 5 Sep 2014 07:21:04 +0200

On Thu, Sep 4, 2014 at 10:23 PM, Dan van der Ster <daniel.vanderster at cern.ch
> wrote:

> Hi Martin,
>
> September 4 2014 10:07 PM, "Martin B Nielsen" <martin at unity3d.com> wrote:
> > Hi Dan,
> >
> > We took a different approach (and our cluster is tiny compared to many
> others) - we have two pools;
> > normal and ssd.
> >
> > We use 14 disks in each osd-server; 8 platter and 4 ssd for ceph, and 2
> ssd for OS/journals. We
> > partitioned the two OS ssd as raid1 using about half the space for the
> OS and leaving the rest on
> > each for 2x journals and unprovisioned. We've partitioned the OS disks
> to each hold 2x platter
> > journals. On top of that our ssd pooled disks also hold 2x journals;
> their own + an additional from
> > a platter disk. We have 8 osd-nodes.
> >
> > So whenever an ssd fail we lose 2 osd (but never more).
>
> Interesting ... you have quite a few SSDs there per box. I suppose my
> closest config would be 5 platter journals per SSD, plus a
> FileStore+journal to squeeze out every last IOP from our SSDs. That would
> take out 6 OSDs with a failure.
>
> > We've had this system in production for ~1? year now and so far we've
> had 1 ssd and 2 platter disk
> > fail. We run a couple of hundred vm-guests on it and use ~60TB.
>
> Which SSD was that?
>

It was a ssd-osd disk. So we lost a disk in each pool. The platter disk
using it as journal was gone/corrupted and had to be re-instated in the
pool with full backfills once we replaced the ssd.

>
> > On a daily basis we avg. 30MB/sec r/w and ~600 iops so not very high
> usage. The times we lost disks
> > we hardly noticed. All SSD (OS included) have a general utilization of
> <5%, platter disks near 10%.
>
> We have peaks up to 7000 iops, but mostly between 4-5000. When we have
> 7000 iops the small write latency inches up to around 70ms :(
>

I've been too used to look at guest data only :) - If I look at munin we're
avg. ~2000 iops in the cluster. Guess that makes sense with replica x3 and
scrubbing. We peak at 5000, but not very often.

I checked some of our more io-busy guests and the ones running platter have
w_await ~15ms and r_await ~15ms. Busy ssd-pooled guests show ~5ms both r/w.
I would have imagined the w_await to be similar, but it is difficult to
compare the two as they have different workload. The platter one uses very
high cpu on top so could account for some of the extra wait.

>
> > We did a lot of initial testing about putting journals on the OS-ssd as
> well extra on the ssd-osd,
> > but we didn't find much difference or high latencies as others have
> experienced. When/if we notice
> > otherwise we'll prob. switch to pure ssd as journalholders.
> >
> > We originally deployed using saltstack and even though we have automated
> replacing disks we still
> > do it manually 'just to be sure'. It takes 5-10min to replace an old
> disk and get it backfilling,
> > so I don't expect us to spend any time automating this.
> >
> > Recovering 2 disks at once for us takes a long time but we've
> intentionally set backfilling low and
> > it is not noticeable on the cluster when it happens.
>
> Yeah 2 wouldn't be noticeable in our cluster even now. 24 _was_
> noticeable, so I maybe 5 is doable.
>

Uff, I'd expect so - I'd imagine quite a lot of work for the cluster having
to deal with 24 lost osd no matter how big it is.

Cheers,
Martin

>
> Thanks for the input,
>
> Dan
>
>
> > Anyways, we have pretty low cluster usage but in our experience ssd seem
> to handle the constant
> > load very well.
> >
> > Cheers,
> > Martin
> >
> > On Thu, Sep 4, 2014 at 6:21 PM, Dan Van Der Ster <
> daniel.vanderster at cern.ch> wrote:
> >
> >> Dear Cephalopods,
> >>
> >> In a few weeks we will receive a batch of 200GB Intel DC S3700?s to
> augment our cluster, and I?d
> >> like to hear your practical experience and discuss options how best to
> deploy these.
> >>
> >> We?ll be able to equip each of our 24-disk OSD servers with 4 SSDs, so
> they will become 20 OSDs +
> > 4
> >> SSDs per server. Until recently I?ve been planning to use the
> traditional deployment: 5 journal
> >> partitions per SSD. But as SSD-day approaches, I growing less
> comfortable with the idea of 5 OSDs
> >> going down every time an SSD fails, so perhaps there are better options
> out there.
> >>
> >> Before getting into options, I?m curious about real reliability of
> these drives:
> >>
> >> 1) How often are DC S3700's failing in your deployments?
> >> 2) If you have SSD journals at a ratio of 1 to 4 or 5, how painful is
> the backfilling which
> > results
> >> from an SSD failure? Have you considered tricks like increasing the
> down out interval so
> >> backfilling doesn?t happen in this case (leaving time for the SSD to be
> replaced)?
> >>
> >> Beyond the usually 5 partitions deployment, is anyone running a RAID1
> or RAID10 for the journals?
> >> If so, are you using the raw block devices or formatting it and storing
> the journals as files on
> >> the SSD array(s)? Recent discussions seem to indicate that XFS is just
> as fast as the block dev,
> >> since these drives are so fast.
> >>
> >> Next, I wonder how people with puppet/chef/? are handling the
> creation/re-creation of the SSD
> >> devices. Are you just wiping and rebuilding all the dependent OSDs
> completely when the journal
> > dev
> >> fails? I?m not keen on puppetizing the re-creation of journals for
> OSDs...
> >>
> >> We also have this crazy idea of failing over to a local journal file in
> case an SSD fails. In
> > this
> >> model, when an SSD fails we?d quickly create a new journal either on
> another SSD or on the local
> >> OSD filesystem, then restart the OSDs before backfilling started.
> Thoughts?
> >>
> >> Lastly, I would also consider using 2 of the SSDs in a data pool (with
> the other 2 SSDs to hold
> > 20
> >> journals ? probably in a RAID1 to avoid backfilling 10 OSDs when an SSD
> fails). If the 10-1 ratio
> >> of SSDs would perform adequately, that?d give us quite a few SSDs to
> build a dedicated high-IOPS
> >> pool.
> >>
> >> I?d also appreciate any other suggestions/experiences which might be
> relevant.
> >>
> >> Thanks!
> >> Dan
> >>
> >> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users at lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140905/92db4d43/attachment.htm>