Newbie Ceph Design Questions

clewis@xxxxxxxxxxxxxxxxxx (Craig Lewis) · Fri, 19 Sep 2014 18:29:02 -0700

I'm personally interested in running Ceph on some RAID-Z2 volumes with
ZILs.  XFS feels really dates after using ZFS.  I need to check the
progress, but I'm thinking of reformatting one node once Giant comes out.

On Thu, Sep 18, 2014 at 6:36 AM, Christian Balzer <chibi at gol.com> wrote:

>
> Hello,
>
> On Thu, 18 Sep 2014 13:07:35 +0200 Christoph Adomeit wrote:
>
>
> > Presently we use Solaris ZFS Boxes as NFS Storage for VMs.
> >
> That sounds slower than I would Ceph RBD expect to be in nearly all cases.
>
> Also, how do you replicate the filesystems to cover for node failures?
>

I have used zfs snapshots and zfs send/receive in a cron.  It's not live
replication, but it's fast enough that I could run it every 5 minutes, and
maybe every minute.

> Next question: I read that in Ceph an OSD is marked invalid, as
> > soon as its journaling disk is invalid. So what should I do ? I don't
> > want to use 1 Journal Disk for each osd. I also dont want to use
> > a journal disk per 4 osds because then I will loose 4 osds if an ssd
> > fails. Using journals on osd Disks i am afraid will be slow.
> > Again I am afraid of slow Ceph performance compared to zfs because
> > zfs supports zil write cache disks .
> >
> I don't do ZFS, but it is my understanding that loosing the ZIL cache
> (presumably on a SSD for speed reasons) will also potentially loose you
> the latest writes. So not really all that different from Ceph.
>

ZFS will lose only the data that was in the ZIL, but not on disk.  It
requires admin intervention to tell ZFS to forget about the lost data.  ZFS
will allow you to read/write any data that was already on the disks.

> In that scenario loosing even 4 OSDs due to a journal SSD failure would
> not be the end of the world by long shot. Never mind that if you're using
> the right SSDs (Intel DC 3700S for example) you're unlikely to ever
> experience such a failure.
> And even if so, there are again plenty of discussions in this ML how to
> mitigate the effects of such failure (in terms of replication traffic and
> its impact on the cluster performance, data redundancy should really never
> be the issue).
>

The main issue is performance during recovery.  You really don't want
recovery to affect more than a few percent of your OSDs, otherwise you'll
start having latency problems.  If losing a single SSD will lose 20% of
your OSDs, recovery is going to hurt.  If losing a single SSD only loses 1%
of your  OSDS, don't worry about it.

The Anti-Cephalopod discussion here was pretty lively, with a lot of good
info.

>
> > Last Question: Someone told me Ceph Snapshots are slow. Is this true ?
> > I always thought making a snapshot is just moving around some pointers
> > to data.
> >
> No idea, I don't use them.
> But from what I gather the DELETION of them (like RBD images) is a rather
> resource intensive process, not the creation.
>

The snapshots themselves aren't particularly slow, but the cleanup when
using XFS is pretty painful.

Ceph on XFS has to emulate snapshots, and XFS isn't copy-on-write.  Ceph on
BtrFS uses BtrFS's snapshots, so they're much faster.

I believe Ceph on ZFS just got ZFS snapshot support.  That's what I'm
waiting for before I start testing.

>
> > And very last question: What about btrfs, still not recommended ?
> >
> Definitely not from where I'm standing.
> Between the inherent disadvantage of using BTRFS (CoW, thus fragmentation
> galore) for VM storage and actual bugs people run into I don't think it
> ever will be.
>

My personal opinion is that BtrFS is dead, but nobody is willing to say it
out loud.  Oracle was driving the BtrFS development, and now they own ZFS.
 Why bother?

COW fragmention is a problem.  ZFS gets around it by telling you to never
fill the disks up more than 80%.  The writes slow down once you hit 80%,
and they get progressively slower the closer you get to 100%.  I can live
with that; I'll just set the nearfull ratio to 75%.

> I venture that Key/Value store systems will be both faster and more
> reliable than BTRFS within a year or so.
>

Also very interesting.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20140919/b2ebb481/attachment.htm>