Hello, On Thu, 18 Sep 2014 13:07:35 +0200 Christoph Adomeit wrote: > > Hello Ceph-Community, > > we are considering to use a Ceph Cluster for serving VMs. > We need goog performance and absolute stability. > I really don't want to sound snarky here, but you get what you pay for, as in the old adage of "cheap, fast, reliable. pick one." still holds. That said, Ceph can probably fulfill your needs if you're willing to invest the time (learning curve, testing) and money (resources). > Regarding Ceph I have a few questions. > > Presently we use Solaris ZFS Boxes as NFS Storage for VMs. > That sounds slower than I would Ceph RBD expect to be in nearly all cases. Also, how do you replicate the filesystems to cover for node failures? > The zfs boxes are totally fast, because they use all free ram > for read caches. With arc stats we can see that 90% of all read > operations are served from memory. Also read cache in zfs is very > intelligent about what blocks to put in the read cache. > > From Reading about Ceph it seems that ceph Clusters dont have > such an optimized read cache. Do you think we can still perform > as well as the solaris boxes ? > It's called the linux page cache. If you're spending enough money to fill your OSD nodes with similar amounts of RAM the ratio will also be similar. I have a ceph storage cluster with just 2 storage nodes (don't ask, read my older posts if you want to know how and why) with a 32GB RAM each and they serve nearly all reads for about 100 VMs out of that cache space, too. More memory in OSD nodes is definitely one of the best ways to improve performance with Ceph. In the future (not now really, the feature is much too new) Ceph cache pools (SSD based) are likely to be very helpful with working sets that go beyond OSD RAM size. > Next question: I read that in Ceph an OSD is marked invalid, as > soon as its journaling disk is invalid. So what should I do ? I don't > want to use 1 Journal Disk for each osd. I also dont want to use > a journal disk per 4 osds because then I will loose 4 osds if an ssd > fails. Using journals on osd Disks i am afraid will be slow. > Again I am afraid of slow Ceph performance compared to zfs because > zfs supports zil write cache disks . > I don't do ZFS, but it is my understanding that loosing the ZIL cache (presumably on a SSD for speed reasons) will also potentially loose you the latest writes. So not really all that different from Ceph. With Ceph you will (if you strive for that "absolute stability" you mention, which I also interpret as reliability) have at least a replica size of 3, so 3 copies of your data. On different nodes of course. So loosing an OSD or 4 isn't great, but it's not the end of the world either. There are many discussions here that cover these subjects, you can go for maximum speed and low cost when all is working (the classic Ceph setup with SSD journals for 2-5 HDDs) to things with a RAID1 journal in front of a RAID6 OSD storage. This all depends on your goals/needs and wallet size. If you have no control or idea over what your VMs are doing and thus want a generic, high performance cluster, look at: https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf I find this document a bit dated and optimistic at points given recent developments, but it is a very good basis to start from. You can juggle numbers, but unless you're willing to do the tests in your specific case yourself and optimize for it, I would recommend something like 9 storage nodes (with n OSDs depending on your requirements) and at least 3 monitors for the initial deployment. If your storage nodes have SSDs for the OS and plenty of CPU/RAM reserves, I see no reason to not put monitors on them if you're tight for space or money. In that scenario loosing even 4 OSDs due to a journal SSD failure would not be the end of the world by long shot. Never mind that if you're using the right SSDs (Intel DC 3700S for example) you're unlikely to ever experience such a failure. And even if so, there are again plenty of discussions in this ML how to mitigate the effects of such failure (in terms of replication traffic and its impact on the cluster performance, data redundancy should really never be the issue). > Last Question: Someone told me Ceph Snapshots are slow. Is this true ? > I always thought making a snapshot is just moving around some pointers > to data. > No idea, I don't use them. But from what I gather the DELETION of them (like RBD images) is a rather resource intensive process, not the creation. > And very last question: What about btrfs, still not recommended ? > Definitely not from where I'm standing. Between the inherent disadvantage of using BTRFS (CoW, thus fragmentation galore) for VM storage and actual bugs people run into I don't think it ever will be. I venture that Key/Value store systems will be both faster and more reliable than BTRFS within a year or so. Christian > Thanks for helping > > Christoph > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/