Re: Designing a cluster guide

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 21 May 2012 17:30:14 -0700

On Mon, May 21, 2012 at 4:52 PM, Quenten Grasso <QGrasso@xxxxxxxxxx> wrote:
> Hi All,
>
>
> I've been thinking about this issue myself past few days, and an idea I've come up with is running 16 x 2.5" 15K 72/146GB Disks,
> in raid 10 inside a 2U Server with JBOD's attached to the server for actual storage.
>
> Can someone help clarify this one,
>
> Once the data is written to the (journal disk) and then read from the (journal disk) then written to the (storage disk) once this is complete this is considered a successful write by the client?
> Or
> Once the data is written to the (journal disk) is this considered successful by the client?
This one — the write is considered "safe" once it is on-disk on all
OSDs currently responsible for hosting the object.

Every time anybody mentions RAID10 I have to remind them of the
storage amplification that entails, though. Are you sure you want that
on top of (well, underneath, really) Ceph's own replication?

> Or
> Once the data is written to the (journal disk) and written to the (storage disk) at the same time, once complete this is considered a successful write by the client? (if this is the case SSD's may not be so useful)
>
>
> Pros
> Quite fast Write throughput to the journal disks,
> No write wareout of SSD's
> RAID 10 with 1GB Cache Controller also helps improve things (if really keen you could use a cachecade as well)
>
>
> Cons
> Not as fast as SSD's
> More rackspace required per server.
>
>
> Regards,
> Quenten
>
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Slawomir Skowron
> Sent: Tuesday, 22 May 2012 7:22 AM
> To: ceph-devel@xxxxxxxxxxxxxxx
> Cc: Tomasz Paszkowski
> Subject: Re: Designing a cluster guide
>
> Maybe good for journal will be two cheap MLC Intel drives on Sandforce
> (320/520), 120GB or 240GB, and HPA changed to 20-30GB only for
> separate journaling partitions with hardware RAID1.
>
> I like to test setup like this, but maybe someone have any real life info ??
>
> On Mon, May 21, 2012 at 5:07 PM, Tomasz Paszkowski <ss7pro@xxxxxxxxx> wrote:
>> Another great thing that should be mentioned is:
>> https://github.com/facebook/flashcache/. It gives really huge
>> performance improvements for reads/writes (especialy on FunsionIO
>> drives) event without using librbd caching :-)
>>
>>
>>
>> On Sat, May 19, 2012 at 6:15 PM, Alexandre DERUMIER <aderumier@xxxxxxxxx> wrote:
>>> Hi,
>>>
>>> For your journal , if you have money, you can use
>>>
>>> stec zeusram ssd drive. (around 2000€ /8GB / 100000 iops read/write with 4k block).
>>> I'm using them with zfs san, they rocks for journal.
>>> http://www.stec-inc.com/product/zeusram.php
>>>
>>> another interessesting product is ddrdrive
>>> http://www.ddrdrive.com/
>>>
>>> ----- Mail original -----
>>>
>>> De: "Stefan Priebe" <s.priebe@xxxxxxxxxxxx>
>>> À: "Gregory Farnum" <greg@xxxxxxxxxxx>
>>> Cc: ceph-devel@xxxxxxxxxxxxxxx
>>> Envoyé: Samedi 19 Mai 2012 10:37:01
>>> Objet: Re: Designing a cluster guide
>>>
>>> Hi Greg,
>>>
>>> Am 17.05.2012 23:27, schrieb Gregory Farnum:
>>>>> It mentions for example "Fast CPU" for the mds system. What does fast
>>>>> mean? Just the speed of one core? Or is ceph designed to use multi core?
>>>>> Is multi core or more speed important?
>>>> Right now, it's primarily the speed of a single core. The MDS is
>>>> highly threaded but doing most things requires grabbing a big lock.
>>>> How fast is a qualitative rather than quantitative assessment at this
>>>> point, though.
>>> So would you recommand a fast (more ghz) Core i3 instead of a single
>>> xeon for this system? (price per ghz is better).
>>>
>>>> It depends on what your nodes look like, and what sort of cluster
>>>> you're running. The monitors are pretty lightweight, but they will add
>>>> *some* load. More important is their disk access patterns — they have
>>>> to do a lot of syncs. So if they're sharing a machine with some other
>>>> daemon you want them to have an independent disk and to be running a
>>>> new kernel&glibc so that they can use syncfs rather than sync. (The
>>>> only distribution I know for sure does this is Ubuntu 12.04.)
>>> Which kernel and which glibc version supports this? I have searched
>>> google but haven't found an exact version. We're using debian lenny
>>> squeeze with a custom kernel.
>>>
>>>>> Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
>>>>> perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
>>>>> and you should go for 22x SSD Disks in a Raid 6?
>>>> You'll need to do your own failure calculations on this one, I'm
>>>> afraid. Just take note that you'll presumably be limited to the speed
>>>> of your journaling device here.
>>> Yeah that's why i wanted to use a Raid 1 of SSDs for the journaling. Or
>>> is this still too slow? Another idea was to use only a ramdisk for the
>>> journal and backup the files while shutting down to disk and restore
>>> them after boot.
>>>
>>>> Given that Ceph is going to be doing its own replication, though, I
>>>> wouldn't want to add in another whole layer of replication with raid10
>>>> — do you really want to multiply your storage requirements by another
>>>> factor of two?
>>> OK correct bad idea.
>>>
>>>>> Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
>>>> I would use the hardware controller over btrfs raid for now; it allows
>>>> more flexibility in eg switching to xfs. :)
>>> OK but overall you would recommand running one osd per disk right? So
>>> instead of using a Raid 6 with for example 10 disks you would run 6 osds
>>> on this machine?
>>>
>>>>> Use single socket Xeon for the OSDs or Dual Socket?
>>>> Dual socket servers will be overkill given the setup you're
>>>> describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
>>>> daemon. You might consider it if you decided you wanted to do an OSD
>>>> per disk instead (that's a more common configuration, but it requires
>>>> more CPU and RAM per disk and we don't know yet which is the better
>>>> choice).
>>> Is there also a rule of thumb for the memory?
>>>
>>> My biggest problem with ceph right now is the awful slow speed while
>>> doing random reads and writes.
>>>
>>> Sequential read and writes are at 200Mb/s (that's pretty good for bonded
>>> dual Gbit/s). But random reads and write are only at 0,8 - 1,5 Mb/s
>>> which is def. too slow.
>>>
>>> Stefan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>>
>>> --
>>>
>>>
>>>
>>>
>>>        Alexandre D erumier
>>> Ingénieur Système
>>> Fixe : 03 20 68 88 90
>>> Fax : 03 20 68 90 81
>>> 45 Bvd du Général Leclerc 59100 Roubaix - France
>>> 12 rue Marivaux 75002 Paris - France
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> Tomasz Paszkowski
>> SS7, Asterisk, SAN, Datacenter, Cloud Computing
>> +48500166299
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
> --
> -----
> Pozdrawiam
>
> Sławek "sZiBis" Skowron
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html