Re: Review of Ceph on ZFS - or how not to deploy Ceph for RBD + OpenStack

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11-1-2017 08:06, Adrian Saul wrote:
> 
> I would concur having spent a lot of time on ZFS on Solaris.
> 
> ZIL will reduce the fragmentation problem a lot (because it is not
> doing intent logging into the filesystem itself which fragments the
> block allocations) and write response will be a lot better.  I would
> use different devices for L2ARC and ZIL - ZIL needs to be small and
> fast for writes (and mirrored - we have used some HGST 16G devices
> which are designed as ZILs - pricy but highly recommend) - L2ARC just
> needs to be faster for reads than your data disks, most SSDs would be
> fine for this.

Been using ZFS on FreeBSD ever since 2006, an I really like it.
Other than that it does not scale horizontally.

Ceph does a lot of sync()-type calls.
If you do not have a ZIL on SSDs, then ZFS create a ZIL on HHD for the
sync() writes....
Most of the documentation then talks about using that to reliably speed
up NFS. But it is actually for ANY sync() operation.

> A 14 disk RAIDZ2 is also going to be very poor for writes especially
> with SATA - you are effectively only getting one disk worth of IOPS
> for write as each write needs to hit all disks.  Without a ZIL you
> are also losing out on write IOPS for ZIL and metadata operations.

I would definitely not have used a RAIDZ2 if speed is of the utmost
importance. It has it's advantages, but now you are both using ZFS's
redundancy AND the redundancy that is in CEPH.
So 2 extra HDD's in ZFS, and then on to off that the CEPH redundancy.

I haven't tried a large cluster yet, but if money allows it my choice
would be 2 disks mirrors per OSD in a vdev-pool. And use that with a ZIL
on SSD. This gives you 2* write speed IOPS of the disks.
Using the raid-types does not give you much extras for speed when tere
are more spindles.

One of the things that would be tempting is to even have only 1 disk in
a vdev, and let ceph do the rest. Problem is that you will need to
ZFS-scrub more often, and repair manually. Because errors will be
detected, but cannot be repaired.

We have not even discussed compression in ZFS, because that again is a
large way of getting more speed out of the system...

There are also some questions that I'm wondering about:
 - L2ARC uses (lots of) core memory, so do the OSDs and then there is
the buffer. All these interact, and compete for free RAM.
   What mix is sensible and gets most out of the memory you have.
 - If you have a fast ZIL, would you still need a journal in Ceph?

Just my 2cts,
--WjW


>> -----Original Message----- From: ceph-users
>> [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Patrick
>> Donnelly Sent: Wednesday, 11 January 2017 5:24 PM To: Kevin
>> Olbrich Cc: Ceph Users Subject: Re:  Review of Ceph on
>> ZFS - or how not to deploy Ceph for RBD + OpenStack
>> 
>> Hello Kevin,
>> 
>> On Tue, Jan 10, 2017 at 4:21 PM, Kevin Olbrich <ko@xxxxxxx> wrote:
>>> 5x Ceph node equipped with 32GB RAM, Intel i5, Intel DC P3700
>>> NVMe journal,
>> 
>> Is the "journal" used as a ZIL?
>> 
>>> We experienced a lot of io blocks (X requests blocked > 32 sec)
>>> when a lot of data is changed in cloned RBDs (disk imported via
>>> OpenStack Glance, cloned during instance creation by Cinder). If
>>> the disk was cloned some months ago and large software updates
>>> are applied (a lot of small files) combined with a lot of syncs,
>>> we often had a node hit suicide timeout. Most likely this is a
>>> problem with op thread count, as it is easy to block threads with
>>> RAIDZ2 (RAID6) if many small operations are written to disk
>>> (again, COW is not optimal here). When recovery took place
>>> (0.020% degraded) the cluster performance was very bad - remote
>>> service VMs (Windows) were unusable. Recovery itself was using 70
>>> - 200 mb/s which was okay.
>> 
>> I would think having an SSD ZIL here would make a very large
>> difference. Probably a ZIL may have a much larger performance
>> impact than an L2ARC device. [You may even partition it and have
>> both but I'm not sure if that's normally recommended.]
>> 
>> Thanks for your writeup!
>> 
>> -- Patrick Donnelly 
>> _______________________________________________ ceph-users mailing
>> list ceph-users@xxxxxxxxxxxxxx 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> Confidentiality: This email and any attachments are confidential and
> may be subject to copyright, legal or some other professional
> privilege. They are intended solely for the attention and use of the
> named addressee(s). They may only be copied, distributed or disclosed
> with the consent of the copyright owner. If you have received this
> email by mistake or by breach of the confidentiality clause, please
> notify the sender immediately by return email and delete or destroy
> all copies of the email. Any confidentiality, privilege or copyright
> is not waived or lost because this email has been sent to you by
> mistake. _______________________________________________ ceph-users
> mailing list ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux