Re: Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Wed, 11 Jan 2017 07:06:31 +0000

I would concur having spent a lot of time on ZFS on Solaris.

ZIL will reduce the fragmentation problem a lot (because it is not doing intent logging into the filesystem itself which fragments the block allocations) and write response will be a lot better.  I would use different devices for L2ARC and ZIL - ZIL needs to be small and fast for writes (and mirrored - we have used some HGST 16G devices which are designed as ZILs - pricy but highly recommend) - L2ARC just needs to be faster for reads than your data disks, most SSDs would be fine for this.

A 14 disk RAIDZ2 is also going to be very poor for writes especially with SATA - you are effectively only getting one disk worth of IOPS for write as each write needs to hit all disks.  Without a ZIL you are also losing out on write IOPS for ZIL and metadata operations.

> -----Original Message-----
> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of
> Patrick Donnelly
> Sent: Wednesday, 11 January 2017 5:24 PM
> To: Kevin Olbrich
> Cc: Ceph Users
> Subject: Re:  Review of Ceph on ZFS - or how not to deploy Ceph
> for RBD + OpenStack
>
> Hello Kevin,
>
> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich <ko@xxxxxxx> wrote:
> > 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700 NVMe
> > journal,
>
> Is the "journal" used as a ZIL?
>
> > We experienced a lot of io blocks (X requests blocked > 32 sec) when a
> > lot of data is changed in cloned RBDs (disk imported via OpenStack
> > Glance, cloned during instance creation by Cinder).
> > If the disk was cloned some months ago and large software updates are
> > applied (a lot of small files) combined with a lot of syncs, we often
> > had a node hit suicide timeout.
> > Most likely this is a problem with op thread count, as it is easy to
> > block threads with RAIDZ2 (RAID6) if many small operations are written
> > to disk (again, COW is not optimal here).
> > When recovery took place (0.020% degraded) the cluster performance was
> > very bad - remote service VMs (Windows) were unusable. Recovery itself
> > was using
> > 70 - 200 mb/s which was okay.
>
> I would think having an SSD ZIL here would make a very large difference.
> Probably a ZIL may have a much larger performance impact than an L2ARC
> device. [You may even partition it and have both but I'm not sure if that's
> normally recommended.]
>
> Thanks for your writeup!
>
> --
> Patrick Donnelly
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com