Newbie Ceph Design Questions

chibi@xxxxxxx (Christian Balzer) · Thu, 18 Sep 2014 22:36:33 +0900

Hello,

On Thu, 18 Sep 2014 13:07:35 +0200 Christoph Adomeit wrote:

> 
> Hello Ceph-Community,
> 
> we are considering to use a Ceph Cluster for serving VMs.
> We need goog performance and absolute stability.
> 
I really don't want to sound snarky here, but you get what you pay for, as
in the old adage of "cheap, fast, reliable. pick one." still holds.

That said, Ceph can probably fulfill your needs if you're willing to invest
the time (learning curve, testing) and money (resources). 

> Regarding Ceph I have a few questions.
> 
> Presently we use Solaris ZFS Boxes as NFS Storage for VMs.
> 
That sounds slower than I would Ceph RBD expect to be in nearly all cases.

Also, how do you replicate the filesystems to cover for node failures? 

> The zfs boxes are totally fast, because they use all free ram
> for read caches. With arc stats we can see that 90% of all read 
> operations are served from memory. Also read cache in zfs is very 
> intelligent about what blocks to put in the read cache.
> 
> From Reading about Ceph it seems that ceph Clusters dont have
> such an optimized read cache. Do you think we can still perform
> as well as the solaris boxes ?
> 
It's called the linux page cache. If you're spending enough money to fill
your OSD nodes with similar amounts of RAM the ratio will also be similar.
I have a ceph storage cluster with just 2 storage nodes (don't ask, read
my older posts if you want to know how and why) with a 32GB RAM each and
they serve nearly all reads for about 100 VMs out of that cache space,
too. 
More memory in OSD nodes is definitely one of the best ways to improve
performance with Ceph. 

In the future (not now really, the feature is much too new) Ceph cache
pools (SSD based) are likely to be very helpful with working sets that go
beyond OSD RAM size.

> Next question: I read that in Ceph an OSD is marked invalid, as 
> soon as its journaling disk is invalid. So what should I do ? I don't
> want to use 1 Journal Disk for each osd. I also dont want to use 
> a journal disk per 4 osds because then I will loose 4 osds if an ssd
> fails. Using journals on osd Disks i am afraid will be slow.
> Again I am afraid of slow Ceph performance compared to zfs because
> zfs supports zil write cache disks .
> 
I don't do ZFS, but it is my understanding that loosing the ZIL cache
(presumably on a SSD for speed reasons) will also potentially loose you
the latest writes. So not really all that different from Ceph.

With Ceph you will (if you strive for that "absolute stability" you
mention, which I also interpret as reliability) have at least a replica
size of 3, so 3 copies of your data. On different nodes of course.
So loosing an OSD or 4 isn't great, but it's not the end of the world
either. There are many discussions here that cover these subjects, you can
go for maximum speed and low cost when all is working (the classic Ceph
setup with SSD journals for 2-5 HDDs) to things with a RAID1 journal in
front of a RAID6 OSD storage. This all depends on your goals/needs and
wallet size. 

If you have no control or idea over what your VMs are doing and thus want
a generic, high performance cluster, look at:

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

I find this document a bit dated and optimistic at points given recent
developments, but it is a very good basis to start from.

You can juggle numbers, but unless you're willing to do the tests in your
specific case yourself and optimize for it, I would recommend something
like 9 storage nodes (with n OSDs depending on your requirements) and at
least 3 monitors for the initial deployment. 
If your storage nodes have SSDs for the OS and plenty of CPU/RAM reserves,
I see no reason to not put monitors on them if you're tight for space or
money. 

In that scenario loosing even 4 OSDs due to a journal SSD failure would
not be the end of the world by long shot. Never mind that if you're using
the right SSDs (Intel DC 3700S for example) you're unlikely to ever
experience such a failure. 
And even if so, there are again plenty of discussions in this ML how to
mitigate the effects of such failure (in terms of replication traffic and
its impact on the cluster performance, data redundancy should really never
be the issue). 

> Last Question: Someone told me Ceph Snapshots are slow. Is this true ?
> I always thought making a snapshot is just moving around some pointers 
> to data.
>
No idea, I don't use them.
But from what I gather the DELETION of them (like RBD images) is a rather
resource intensive process, not the creation.

> And very last question: What about btrfs, still not recommended ?
> 
Definitely not from where I'm standing.
Between the inherent disadvantage of using BTRFS (CoW, thus fragmentation
galore) for VM storage and actual bugs people run into I don't think it
ever will be.

I venture that Key/Value store systems will be both faster and more
reliable than BTRFS within a year or so.

Christian

> Thanks for helping
> 
> Christoph
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/