Re: Real world benefit from SSD Journals for a more read than write cluster

Quentin Hartman <qhartman@xxxxxxxxxxxxxxxxxxx> · Thu, 9 Jul 2015 16:38:32 -0600

Sooooo, I was running with size=2, until we had a network interface on an OSD node go faulty, and start corrupting data. Because ceph couldn't tell which copy was right it caused all sorts of trouble. I might have been able to recover more gracefully had I caught the problem sooner and been able to identify the root right away, but as it was, we ended up labeling every VM in the cluster suspect destroying the whole thing and restoring from backups. I didn't end up managing to find the root of the problem until I was rebuilding the cluster and noticed one node "felt weird" when I was ssh'd into it. It was painful.
We are currently running "important" vms from a ceph pool with size=3, and more disposable ones from a size=2 pool, and that seems to be a reasonable tradeoff so far, giving us a bit more IO overhead tha nwe would have running 3 for everything, but still having safety where we need it.

QH

On Thu, Jul 9, 2015 at 3:46 PM, Götz Reinicke <goetz.reinicke@xxxxxxxxxxxxxxx> wrote:
Hi Warren,

thanks for that feedback. regarding the 2 or 3 copies we had a lot of internal discussions and lots of pros and cons on 2 and 3 :) … and finally decided to give 2 copies in the first - now called evaluation cluster - a chance to prove.

I bet in 2016 we will see, if that was a good decision or bad and data los is in that scenario ok. We evaluate. :)

Regarding one P3700 for 12 SATA disks I do get it right, that if that P3700 fails all 12 OSDs are lost… ? So that looks like a bigger risk to me from my current knowledge. Or are the P3700 so much more reliable than the eg. S3500 or S3700?

Or is the suggestion with the P3700 if we go in the direction of 20+ nodes and till than stay without SSDs for journaling.

I really appreciate your thoughts and feedback and I’m aware of the fact that building a ceph cluster is some sort of knowing the specs, configuration option, math, experience, modification and feedback from best practices real world clusters. Finally all clusters are unique in some way and what works for one will not work for an other.

Thanks for feedback, 100 kowtows . Götz

> Am 09.07.2015 um 16:58 schrieb Wang, Warren <Warren_Wang@xxxxxxxxxxxxxxxxx>:

>

> You'll take a noticeable hit on write latency. Whether or not it's tolerable will be up to you and the workload you have to capture. Large file operations are throughput efficient without an SSD journal, as long as you have enough spindles.

>

> About the Intel P3700, you will only need 1 to keep up with 12 SATA drives. The 400 GB is probably okay if you keep the journal sizes small, but the 800 is probably safer if you plan on leaving these in production for a few years. Depends on the turnover of data on the servers.

>

> The dual disk failure comment is pointing out that you are more exposed for data loss with 2 copies. You do need to understand that there is a possibility for 2 drives to fail either simultaneously, or one before the cluster is repaired. As usual, this is going to be a decision you need to decide if it's acceptable or not. We have many clusters, and some are 2, and others are 3. If your data resides nowhere else, then 3 copies is the safe thing to do. That's getting harder and harder to justify though, when the price of other storage solutions using erasure coding continues to plummet.

>

> Warren

>

> -----Original Message-----

> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Götz Reinicke - IT Koordinator

> Sent: Thursday, July 09, 2015 4:47 AM

> To: ceph-users@xxxxxxxxxxxxxx

> Subject: Re:  Real world benefit from SSD Journals for a more read than write cluster

>

> Hi Christian,

> Am 09.07.15 um 09:36 schrieb Christian Balzer:

>>

>> Hello,

>>

>> On Thu, 09 Jul 2015 08:57:27 +0200 Götz Reinicke - IT Koordinator wrote:

>>

>>> Hi again,

>>>

>>> time is passing, so is my budget :-/ and I have to recheck the

>>> options for a "starter" cluster. An expansion next year for may be an

>>> openstack installation or more performance if the demands rise is

>>> possible. The "starter" could always be used as test or slow dark archive.

>>>

>>> At the beginning I was at 16SATA OSDs with 4 SSDs for journal per

>>> node, but now I'm looking for 12 SATA OSDs without SSD journal. Less

>>> performance, less capacity I know. But thats ok!

>>>

>> Leave the space to upgrade these nodes with SSDs in the future.

>> If your cluster grows large enough (more than 20 nodes) even a single

>> P3700 might do the trick and will need only a PCIe slot.

>

> If I get you right, the 12Disk is not a bad idea, if there would be the need of SSD Journal I can add the PCIe P3700.

>

> In the 12 OSD Setup I should get 2 P3700 one per 6 OSDs.

>

> God or bad idea?

>

>>

>>> There should be 6 may be with the 12 OSDs 8 Nodes with a repl. of 2.

>>>

>> Danger, Will Robinson.

>> This is essentially a RAID5 and you're plain asking for a double disk

>> failure to happen.

>

> May be I do not understand that. size = 2 I think is more sort of raid1 ... ? And why am I asking for for a double disk failure?

>

> To less nodes, OSDs or because of the size = 2.

>

>>

>> See this recent thread:

>> "calculating maximum number of disk and node failure that can be

>> handled by cluster with out data loss"

>> for some discussion and python script which you will need to modify

>> for

>> 2 disk replication.

>>

>> With a RAID5 failure calculator you're at 1 data loss event per 3.5

>> years...

>>

>

> Thanks for that thread, but I dont get the point out of it for me.

>

> I see that calculating the reliability is some sort of complex math ...

>

>>> The workload I expect is more writes of may be some GB of Office

>>> files per day and some TB of larger video Files from a few users per week.

>>>

>>> At the end of this year we calculate to have +- 60 to 80 TB of lager

>>> videofiles in that cluster, which are accessed from time to time.

>>>

>>> Any suggestion on the drop of ssd journals?

>>>

>> You will miss them when the cluster does write, be it from clients or

>> when re-balancing a lost OSD.

>

> I can imagine, that I might miss the SSD Journal, but if I can add the

> P3700 later I feel comfy with it for now. Budget and evaluation related.

>

>       Thanks for your helpful input and feedback. /Götz

>

> --

> Götz Reinicke

> IT-Koordinator

>

> Tel. +49 7141 969 82420

> E-Mail goetz.reinicke@xxxxxxxxxxxxxxx

>

> Filmakademie Baden-Württemberg GmbH

> Akademiehof 10

> 71638 Ludwigsburg

> www.filmakademie.de

>

> Eintragung Amtsgericht Stuttgart HRB 205016

>

> Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg

>

> Geschäftsführer: Prof. Thomas Schadt

>

>

--

Götz Reinicke

IT-Koordinator

Tel. +49 7141 969 82420

E-Mail goetz.reinicke@xxxxxxxxxxxxxxx

Filmakademie Baden-Württemberg GmbH

Akademiehof 10

71638 Ludwigsburg

www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL

Staatssekretär im Ministerium für Wissenschaft,

Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com