Using Ramdisk wi

chibi@xxxxxxx (Christian Balzer) · Thu, 31 Jul 2014 10:58:15 +0900

On Wed, 30 Jul 2014 18:17:16 +0200 Josef Johansson wrote:

> Hi,
> 
> Just chippin in,
> As RAM is pretty cheap right now, it could be an idea to fill all the
> memory slots in the OSDs, bigger chance that the data you've requested
> is actually in ram already then.
> 
While that is very, VERY true, it won't help his perceived bad read speeds
much, as they're not really caused by the OSDs per se.

> You should go with DC S3700 400GB for the journals at least..
> 
That's probably going overboard in the other direction.
While on paper this would be the first model to handle the sequential
write speeds of 3 HDDs, that kind of scenario is pretty unrealistic. 
Even with just one client writing they will never reach those speeds due
to FS overhead, parallel writes caused by replication and so forth.

The only scenario where this makes some sense is one with short, very high
write spikes that can be handled by the journal (both in size and ceph
settings like filestore max/min sync interval), followed by long enough
pauses to scribble the data to the HDDs.

In the end for nearly all use cases obsessing over high write speeds is a
fallacy, one is much more likely to run out of steam due to IOPS caused by
much smaller transactions.

What would worry me about the small DC 3500 is the fact that it is only
rated for about 38GB writes/day over 5 years. Now this could be very well
within the deployment parameters, but we don't know.

A 200GB DC S3700 should be fine here, higher endurance, about 3 times the
speed of the DC 3500 120GB for sequential writes and 8 times for write
IOPS.

Christian

> Cheers,
> Josef
> 
> On 30/07/14 17:12, Christian Balzer wrote:
> > On Wed, 30 Jul 2014 10:50:02 -0400 German Anders wrote:
> >
> >> Hi Christian,
> >>       How are you? Thanks a lot for the answers, mine in red.
> >>
> > Most certainly not in red on my mail client...
> >
> >> --- Original message ---
> >>> Asunto: Re: [ceph-users] Using Ramdisk wi
> >>> De: Christian Balzer <chibi at gol.com>
> >>> Para: <ceph-users at lists.ceph.com>
> >>> Cc: German Anders <ganders at despegar.com>
> >>> Fecha: Wednesday, 30/07/2014 11:42
> >>>
> >>>
> >>> Hello,
> >>>
> >>> On Wed, 30 Jul 2014 09:55:49 -0400 German Anders wrote:
> >>>
> >>>> Hi Wido,
> >>>>
> >>>>              How are you? Thanks a lot for the quick response. I
> >>>> know that is
> >>>> heavy cost on using ramdisk, but also i want to try that to see if i
> >>>> could get better performance, since I'm using a 10GbE network with
> >>>> the following configuration and i can't achieve more than 300MB/s of
> >>>> throughput on rbd:
> >>>>
> >>> Testing the limits of Ceph with a ramdisk based journal to see what
> >>> is possible in terms of speed (and you will find that it is
> >>> CPU/protocol bound) is fine.
> >>> Anything resembling production is a big no-no.
> >> Got it, did you try flashcache from facebook or dm-cache?
> > No.
> >
> >>>
> >>>
> >>>> MON Servers (3):
> >>>>              2x Intel Xeon E3-1270v3 @3.5Ghz (8C)
> >>>>              32GB RAM
> >>>>              2x SSD Intel 120G in RAID1 for OS
> >>>>              1x 10GbE port
> >>>>
> >>>> OSD Servers (4):
> >>>>              2x Intel Xeon E5-2609v2 @2.5Ghz (8C)
> >>>>              64GB RAM
> >>>>              2x SSD Intel 120G in RAID1 for OS
> >>>>              3x SSD Intel 120G for Journals (3 SAS disks: 1 SSD 
> >>>> Journal)
> >>> You're not telling us WHICH actual Intel SSDs you're using.
> >>> If those are DC3500 ones, then 300MB/s totoal isn't a big surprise
> >>> at all,
> >>> as they are capable of 135MB/s writes at most.
> >> The SSD model is Intel SSDSC2BB120G4 firm D2010370
> > That's not really an answer, but then again Intel could have chosen
> > model numbers that resemble their product names.
> >
> > That is indeed a DC 3500, so my argument stands.
> > With those SSDs for your journals, much more than 300MB/s per node is
> > simply not possible, never mind how fast or slow the HDDs perform.
> >
> >>>
> >>>
> >>>>              9x SAS 3TB 6G for OSD
> >>> That would be somewhere over 1GB/s in theory, but give file system
> >>> and other overheads (what is your replication level?) that's a very
> >>> theoretical value indeed.
> >> The RF is 2, so perf should be much better, also notice that read
> >> perf is really poor, around 62MB/s...
> >>
> > A replication factor of 2 means that each write is amplified by 2.
> > So half of your theoretical performance is gone already.
> >
> > Do your tests with atop or iostat running on all storage nodes. 
> > Determine where the bottleneck is, the journals SSDs or the HDDs or
> > (unlikely) something else.
> >
> > Read performance sucks balls with RBD (at least individually), it can
> > be improved by fondling the readahead value. See:
> >
> > http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/8817
> >
> > This is something the Ceph developers are aware of and hopefully will
> > address in the future:
> > https://wiki.ceph.com/Planning/Blueprints/Emperor/Kernel_client_read_ahead_optimization
> >
> > Christian
> >
> >>>
> >>>
> >>> Christian
> >>>
> >>>>              2x 10GbE port (1 for Cluster Network, 1 for Public 
> >>>> Network)
> >>>>
> >>>> - 10GbE Switches (1 for Cluster interconnect and 1 for Public
> >>>> network)
> >>>> - Using Ceph Firefly version 0.80.4.
> >>>>
> >>>>              The thing is that with fio, rados bench and vdbench
> >>>> tools we
> >>>> only see 300MB/s on writes (rand and seq) with bs of 4m and 16
> >>>> threads, that's pretty low actually, yesterday i was talking in the
> >>>> ceph irc and i hit with the presentation that someone from Fujitsu
> >>>> do on Frankfurt and also with some mails with some config at 10GbE
> >>>> and he achieve almost 795MB/s and more... i would like to know if
> >>>> possible how to implement that so we could improve our ceph cluster
> >>>> a little bit more, i actually configure the scheduler on the SSD's
> >>>> disks both OS and Journal to [noop] but still didn't notice any
> >>>> improvement. That's why we would like to try RAMDISK on Journals,
> >>>> i've noticed that he implement that on their Ceph cluster.
> >>>>
> >>>> I will really appreciate the help on this. Also if you need me to
> >>>> send you some more information about the  Ceph scheme please let me
> >>>> know. Also if someone could share some detail conf info will really
> >>>> help!
> >>>>
> >>>> Thanks a lot,
> >>>>
> >>>>
> >>>> German Anders
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> --- Original message ---
> >>>>> Asunto: Re: [ceph-users] Using Ramdisk wi
> >>>>> De: Wido den Hollander <wido at 42on.com>
> >>>>> Para: <ceph-users at lists.ceph.com>
> >>>>> Fecha: Wednesday, 30/07/2014 10:34
> >>>>>
> >>>>> On 07/30/2014 03:28 PM, German Anders wrote:
> >>>>>>
> >>>>>> Hi Everyone,
> >>>>>>
> >>>>>>                                Anybody is using ramdisk to put
> >>>>>> the Journal on it? If
> >>>>>> so, could
> >>>>>> you please share the commands to implement that? since I'm having
> >>>>>> some issues with that and want to test that out to see if i could
> >>>>>> get better
> >>>>>> performance.
> >>>>> Don't do this. When you loose the journal, you loose the OSD. So a
> >>>>> reboot of the machine effectively trashes the data on that OSD.
> >>>>>
> >>>>> Wido
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thanks in advance,
> >>>>>>
> >>>>>> *German Anders
> >>>>>> *
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list
> >>>>>> ceph-users at lists.ceph.com
> >>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Wido den Hollander
> >>>>> 42on B.V.
> >>>>> Ceph trainer and consultant
> >>>>>
> >>>>> Phone: +31 (0)20 700 9902
> >>>>> Skype: contact42on
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list
> >>>>> ceph-users at lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> >>> --
> >>> Christian Balzer        Network/Systems Engineer
> >>> chibi at gol.com   	Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/