Re: Possible improvements for a slow write speed (excluding independent SSD journals)

Christian Balzer <chibi@xxxxxxx> · Sat, 2 May 2015 01:47:51 +0900

Hello,

On Fri, 1 May 2015 12:03:59 -0400 Anthony Levesque wrote:

> By what I read on some of the topics, is it you guys opinion that Ceph
> cannot scale nicely on full SSD cluster. Meaning that no matter how many
> OSD Node we add, at some point you won’t be able to scale pass some
> throughput.

No, that's not what at least I'm saying at all.
Ceph scales quite well, much better than some other distributed storage
solutions. 
The more nodes and/or OSDs, the better. 

However those nodes need to be balanced and well designed, your original
try with the 1TB EVOs was limited by those SSDs.
Having 16 (fast) SSDs per node is going to be limited by the CPU resources
to handle the potential IOPS they're capable of.
Your network might be another limiting factor at some point.

The exercise with Ceph is to deploy well balanced storage nodes, where
"well" is the closest fit to your IOPS needs, budget and other constraints
(rack space, power).

Christian
> --- Anthony Lévesque GloboTech Communications
> Phone: 1-514-907-0050 x 208
> Toll Free: 1-(888)-GTCOMM1 x 208
> Phone Urgency: 1-(514) 907-0047
> 1-(866)-500-1555
> Fax: 1-(514)-907-0750
> alevesque@xxxxxxxxxx <mailto:alevesque@xxxxxxxxxx>
> http://www.gtcomm.net <http://www.gtcomm.net/>
> > On Apr 30, 2015, at 9:32 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> > 
> > On Thu, 30 Apr 2015 18:01:44 -0400 Anthony Levesque wrote:
> > 
> >> Im planning to setup 4-6 POC in the next 2 week to test various
> >> scenarios here.
> >> 
> >> Im checking to get POC with s3610, s3710, p3500(seem to be knew.  I
> >> know the lifespam is lower) and maybe P3700
> >> 
> > Don't ignore the S3700, it is faster in sequential writes than the 3710
> > because it uses older, less dense flash modules, thus more parallelism.
> > 
> > And with Ceph, especially when looking at the journals, you will hit
> > the max sequential write speed limit of the SSD long, loooong before
> > you'll hit the IOPS limit. 
> > Both due to the nature of journal write and the little detail that
> > you'll hit the CPU performance wall before that.
> > 
> >> The speed  of the 400GB p3500 seem very nice and the price is alright.
> >> The major difference will be the durability between the P3700 and
> >> P3500 and the IOPS.
> >> 
> > Read the link below about write amplification, but that is something
> > that happens mostly on the OSD part, which in your case of 1TB EVOs is
> > already a scary prospect in my book.
> > 
> >> in both option, they are the model with the lowest price per MB/s when
> >> compare to the S series.
> >> 
> > Price per MB/s is a good start, don't forget to factor in TBW/$ and
> > try to estimate what write loads your cluster will see.
> > 
> > But all of this is irrelevant if your typical write patterns will
> > exceed your CPU resources while your SSDs are bored.
> > For example this fio in a VM here:
> > ---
> > # fio --size=4G --ioengine=libaio --invalidate=1 --direct=0
> > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32
> > 
> >  write: io=1381.4MB, bw=16364KB/s, iops=4091 , runt= 86419msec
> > ---
> > 
> > Will utilize all 8 3.1 GHz cores here, on a 3 node firefly cluster
> > with 8 HDD OSDs and 4 journal SSDs (100GB S3700) per node. 
> > While the journal SSDs are at 11% and the OSD HDDs at 30-40%
> > utilization. 
> > 
> > When changing that fio to direct=1, the IOPS drop to half of that.
> > 
> > With a block size of 4MB things of course change to the OSDs being 100%
> > busy, the SSDs about 60% (they can only do 200MB/s) and with 3-4 cores
> > worth being idle or in IOwait.
> > 
> >> Model
> >> Price per MB/s
> >> DC S3500
> >> 
> >> 120GB
> >> $1.10
> >> 240GB
> >> $1.01
> >> 300GB
> >> $1.03
> >> 480GB
> >> $1.28
> >> 
> >> 
> >> 
> >> 
> >> DC S3610
> >> 
> >> 200GB
> >> $0.99
> >> 400GB
> >> $1.14
> >> 480GB
> >> $1.24
> >> 
> >> 
> >> DC S3710
> >> 
> >> 200GB
> >> $1.17
> >> 
> >> 
> >> DC P3500
> >> 
> >> 400GB
> >> $0.64
> >> 
> >> 
> >> DC P3700
> >> 
> >> 400GB
> >> $0.96
> >> 
> >> As a side note,  the expense doesn’t scare me directly,  Its more that
> >> we are going blind here since it seem not a lot of people do full SSD
> >> setup.(Or share there experiences)
> >> 
> > See this:
> > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
> > 
> > I'd suggest you try the above tests yourself, you seem to have a
> > significant amount of hardware already.
> > 
> > There are many SSD threads, but so far there's at best one example of a
> > setup going from Firefly to Giant and Hammer.
> > So for me it's hard to qualify and quantify the improvements Hammer
> > brings to SSD based clusters other than "better", maybe about 50%.
> > Which while significant, is obviously nowhere near the raw performance
> > the hardware would be capable of.
> > 
> > But then again, my guestimate is that aside from the significant code
> > that gets executed per Ceph IOP, any such Ceph IOP results in 5-10
> > real IOPs down the line.
> > 
> > Christian
> > 
> >> Anyway still brainstorm this so we can work on some POC. Will you guys
> >> posted here. ---
> >> Anthony Lévesque
> >> 
> >> 
> >>> On Apr 29, 2015, at 11:27 PM, Christian Balzer <chibi@xxxxxxx> wrote:
> >>> 
> >>> 
> >>> 
> >>> Hello,
> >>> 
> >>> On Wed, 29 Apr 2015 15:01:49 -0400 Anthony Levesque wrote:
> >>> 
> >>>> We redid the test with 4MB Block Size (using the same command as
> >>>> before but with 4MB for the BS) and we are getting better result
> >>>> from all devices:
> >>>> 
> >>> That's to be expected of course.
> >>> 
> >>>> Intel DC S3500 120GB = 		148 MB/s
> >>>> Samsung Pro 128GB =		187 MB/s
> >>>> Intel 520 120GB =		154 MB/s
> >>>> Samsung EVO 1TB = 		186 MB/s
> >>>> Intel DC S3500 300GB =		250 MB/s
> >>> 
> >>> You will need to keep _both_ of these results in mind, the 4KB and
> >>> 4MB ones. For worst and best case scenarios.
> >>> And those dd tests are indicators, not a perfect replication of what
> >>> Ceph actually does. 
> >>> Looking at your original results of 920MB/s over 96 1TB EVOs those
> >>> SSDs are thus capable of handling about 20MB/s combined journal/data
> >>> traffic. Ignoring any CPU limitations of course.
> >>> 
> >>> Unsurprisingly the best of both worlds in the SSDs you compared is
> >>> the Intel DC S3500. 
> >>> And that is the slowest write and least endurable of the Intel DC
> >>> SSDs.
> >>> 
> >>> To make a (rough) comparison between Intel and Samsung, the Intel DC
> >>> S3500 are comparable to the Samsung DC Evo ones and the Intel S3700
> >>> to the Samsung DC Pro ones.
> >>> 
> >>>> I have not tested the DC S3610 yet but I will be ordering some soon
> >>> 
> >>> Those will be (for journal purposes) the worst choice when it comes
> >>> to bandwidth, as they use higher density FLASH, thus less speed at
> >>> the same size.
> >>> They are however significantly more durable than the S3500 ones at
> >>> only a slightly higher price, thus making them good candidates for a
> >>> combined journal/data SSD. IF your expected write load fits the
> >>> endurance limits.
> >>> 
> >>> See:
> >>> http://ark.intel.com/compare/75682,86640,71914
> >>> 
> >>>> Since previously we had the journal and OSD on the same SSD Im still
> >>>> wondering if having the journal separate from the SSD (with a ration
> >>>> of 1:3 or 1:4) will actually bring more Write speed. This is the
> >>>> configuration I was thinking of if we separate the Journal from the
> >>>> OSD:
> >>>> 
> >>> Your speed will go up of course. 
> >>> However it will not reach the fullest potential unless you put really
> >>> fast SSDs (or a PCIe, NVMe unit like this:
> >>> http://ark.intel.com/products/79624/Intel-SSD-DC-P3700-Series-400GB-12-Height-PCIe-3_0-20nm-MLC)
> >>> in there, see below.
> >>> 
> >>>> ÷Each OSD_Node÷
> >>>> Dual E5-2620v2 with 64GB of RAM
> >>> Underpowered CPU when dealing with small write IOPS.
> >>> If all/most your writes are nicely coalesced by the RBD cache this
> >>> may not be a problem, but without knowing what your client VMs will
> >>> do it's impossible to predict.
> >>> 
> >>>> -------------------
> >>>> HBA 9207-8i #1
> >>>> 3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610 200GB for
> >>>> the Journal 3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610
> >>>> 200GB for the Journal -------------------
> >>>> HBA 9207-8i #2
> >>>> 3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610 200GB for
> >>>> the Journal 3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610
> >>>> 200GB for the Journal -------------------
> >>>> 1x LSI RAID Card + 2x 120GB SSD (For OS)
> >>>> 2x 10GbE dual port
> >>>> 
> >>> I suppose you already have that hardware except for the journal SSDs,
> >>> right? 
> >>> I would have forgone extra OS SSDs and controllers and put the OS on
> >>> the journal SSDs in a nice RAID10.
> >>> 
> >>>> There would be between 6-8 OSD Node like this to start the cluster.
> >>>> 
> >>>> My goal would be to max out at least 20 Gbps switch ports in writes
> >>>> to a single OpenStack Compute node. (Im still not sure about the CPU
> >>>> capacity)
> >>>> 
> >>> Maxing out the port can come in many flavors/ways. Most of which are
> >>> not realistic scenarios, meaning that your VMs are more likely to run
> >>> out of IOPS before they run out of bandwidth. 
> >>> But to achieve something like 2GB/s writes your SSDs and/or journal
> >>> SSDs will have to handle that speed. 
> >>> 4x 200GB DC S3610s at 230MB/s each for journal SSDs are clearly not
> >>> going to do that.
> >>> 4x 400GB DC S3700s at 460MB/s each will come pretty close to that
> >>> 2GB/s, however they will cost about the same as 2x P3700 mentioned
> >>> below. 16x 800GB DC3610s as combined journal/data SSDs will give you
> >>> about 4GB/s, though.
> >>> If you want to keep using your 1TB EVOs (after having verified what
> >>> their top speed as data only SSD is, but with 16 I suppose it will be
> >>> sufficient), use 16 of them and 2 of the 400GB DC P3700 Series cards
> >>> mentioned above. 
> >>> 
> >>> If all of the above sounds either slow or expensive you're quite
> >>> right, the old adage of "fast, good, cheap" (good being endurance
> >>> here) still holds.
> >>> 
> >>> And once more, having a realistic idea of how much writes will happen
> >>> in that cluster will be crucial to make the right decision here.
> >>> Having to replace your cheap SSDs after a few months when DC level
> >>> SSDs at twice the price would have lasted 5 years is something to
> >>> think about.
> >>> 
> >>> Christian
> >>>> As anyone testes a similar environment?
> >>>> 
> >>>> Anyway guys, lets me know what you think since we are still testing
> >>>> this POC. ---
> >>>> Anthony Lévesque
> >>>> 
> >>>> 
> >>>>> On Apr 25, 2015, at 11:46 PM, Christian Balzer <chibi@xxxxxxx>
> >>>>> wrote:
> >>>>> 
> >>>>> 
> >>>>> Hello,
> >>>>> 
> >>>>> I think that the dd test isn't a 100% replica of what Ceph actually
> >>>>> does then. 
> >>>>> My suspicion would be the 4k blocks, since when people test the
> >>>>> maximum bandwidth they do it with rados bench or other tools that
> >>>>> write the optimum sized "blocks" for Ceph, 4MB ones.
> >>>>> 
> >>>>> I currently have no unused DC S3700s to do a realistic comparison
> >>>>> and the DC S3500 I have aren't used in any Ceph environment.
> >>>>> 
> >>>>> When testing a 200GB DC S3700 that has specs of 35K write IOPS and
> >>>>> 365MB/s sequential writes on mostly idle system (but on top of
> >>>>> Ext4, not the raw device) with a 4k dd dsync test run, atop and
> >>>>> iostat show a 70% SSD utilization, 30k IOPS and 70MB/s writes. 
> >>>>> Which matches the specs perfectly.
> >>>>> If I do that test with 4MB blocks, the speed goes up to 330MB/s and
> >>>>> 90% SSD utilization according to atop, again on par with the specs.
> >>>>> 
> >>>>> Lastly on existing Ceph clusters with DC S3700 SSDs as journals and
> >>>>> rados bench and its 4MB default size that pattern continues.
> >>>>> Smaller sizes with rados naturally (at least on my hardware and
> >>>>> Ceph version, Firefly) run into the limitations of Ceph long
> >>>>> before they hit the SSDs (nearly 100% busy cores, journals at
> >>>>> 4-8%, OSD HDDs anywhere from 50-100%).
> >>>>> 
> >>>>> Of course using the same dd test over all brands will still give
> >>>>> you a good comparison of the SSDs capabilities.
> >>>>> But translating that into actual Ceph journal performance is
> >>>>> another thing.
> >>>>> 
> >>>>> Christian
> >>>>> 
> >>>>> On Sat, 25 Apr 2015 18:32:30 +0200 (CEST) Alexandre DERUMIER wrote:
> >>>>> 
> >>>>>> I'm able to reach around 20000-25000iops with 4k block with s3500
> >>>>>> (with o_dsync) (so yes, around 80-100MB/S).
> >>>>>> 
> >>>>>> I'l bench new s3610 soon to compare.
> >>>>>> 
> >>>>>> 
> >>>>>> ----- Mail original -----
> >>>>>> De: "Anthony Levesque" <alevesque@xxxxxxxxxx>
> >>>>>> À: "Christian Balzer" <chibi@xxxxxxx>
> >>>>>> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> >>>>>> Envoyé: Vendredi 24 Avril 2015 22:00:44
> >>>>>> Objet: Re:  Possible improvements for a slow write
> >>>>>> speed	(excluding independent SSD journals)
> >>>>>> 
> >>>>>> Hi Christian, 
> >>>>>> 
> >>>>>> We tested some DC S3500 300GB using dd if=randfile of=/dev/sda
> >>>>>> bs=4k count=100000 oflag=direct,dsync 
> >>>>>> 
> >>>>>> we got 96 MB/s which is far from the 315 MB/s from the website. 
> >>>>>> 
> >>>>>> Can I ask you or anyone on the mailing list how you are testing
> >>>>>> the write speed for journals? 
> >>>>>> 
> >>>>>> Thanks 
> >>>>>> --- 
> >>>>>> Anthony Lévesque 
> >>>>>> GloboTech Communications 
> >>>>>> Phone: 1-514-907-0050 x 208 
> >>>>>> Toll Free: 1-(888)-GTCOMM1 x 208 
> >>>>>> Phone Urgency: 1-(514) 907-0047 
> >>>>>> 1-(866)-500-1555 
> >>>>>> Fax: 1-(514)-907-0750 
> >>>>>> alevesque@xxxxxxxxxx 
> >>>>>> http://www.gtcomm.net 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> On Apr 23, 2015, at 9:05 PM, Christian Balzer < chibi@xxxxxxx >
> >>>>>> wrote: 
> >>>>>> 
> >>>>>> 
> >>>>>> Hello, 
> >>>>>> 
> >>>>>> On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote: 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_BEGIN
> >>>>>> To update you on the current test in our lab: 
> >>>>>> 
> >>>>>> 1.We tested the Samsung OSD in Recovery mode and the speed was
> >>>>>> able to maxout 2x 10GbE port(transferring data at 2200+ MB/s
> >>>>>> during recovery). So for normal write operation without O_DSYNC
> >>>>>> writes Samsung drives seem ok. 
> >>>>>> 
> >>>>>> 2.We then tested a couple of different model of SSD we had in
> >>>>>> stock with the following command: 
> >>>>>> 
> >>>>>> dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
> >>>>>> 
> >>>>>> This was from a blog written by Sebastien Han and I think should
> >>>>>> be able to show how the drives would perform in O_DSYNC writes.
> >>>>>> For people interested in some result of what we tested here they
> >>>>>> are: 
> >>>>>> 
> >>>>>> Intel DC S3500 120GB = 114 MB/s 
> >>>>>> Samsung Pro 128GB = 2.4 MB/s 
> >>>>>> WD Black 1TB (HDD) = 409 KB/s 
> >>>>>> Intel 330 120GB = 105 MB/s 
> >>>>>> Intel 520 120GB = 9.4 MB/s 
> >>>>>> Intel 335 80GB = 9.4 MB/s 
> >>>>>> Samsung EVO 1TB = 2.5 MB/s 
> >>>>>> Intel 320 120GB = 78 MB/s 
> >>>>>> OCZ Revo Drive 240GB = 60.8 MB/s 
> >>>>>> 4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s 
> >>>>>> 
> >>>>>> 
> >>>>>> 
> >>>>>> No real surprises here, but a nice summary nonetheless. 
> >>>>>> 
> >>>>>> You _really_ want to avoid consumer SSDs for journals and have a
> >>>>>> good idea on how much data you'll write per day and how long you
> >>>>>> expect your SSDs to last (the TBW/$ ratio). 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_BEGIN
> >>>>>> Please let us know if the command we ran was not optimal to test
> >>>>>> O_DSYNC writes 
> >>>>>> 
> >>>>>> We order larger drive from Intel DC series to see if we could get
> >>>>>> more than 200 MB/s per SSD. We will keep you posted on tests if
> >>>>>> that interested you guys. We dint test multiple parallel test yet
> >>>>>> (to simulate multiple journal on one SSD). 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_END
> >>>>>> You can totally trust the numbers on Intel's site: 
> >>>>>> http://ark.intel.com/products/family/83425/Data-Center-SSDs 
> >>>>>> 
> >>>>>> The S3500s are by far the slowest and have the lowest endurance. 
> >>>>>> Again, depending on your expected write level the S3610 or S3700
> >>>>>> models are going to be a better fit regarding price/performance. 
> >>>>>> Especially when you consider that loosing a journal SSD will
> >>>>>> result in several dead OSDs. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_BEGIN
> >>>>>> 3.We remove the Journal from all Samsung OSD and put 2x Intel 330
> >>>>>> 120GB on all 6 Node to test. The overall speed we were getting
> >>>>>> from the rados bench went from 1000 MB/s(approx.) to 450 MB/s
> >>>>>> which might only be because the intel cannot do too much in term
> >>>>>> of journaling (was tested at around 100 MB/s). It will be
> >>>>>> interesting to test with bigger Intel DC S3500 drives(and more
> >>>>>> journals) per node to see if I can back up to 1000MB/s or even
> >>>>>> surpass it. 
> >>>>>> 
> >>>>>> We also wanted to test if the CPU could be a huge bottle neck so
> >>>>>> we swap the Dual E5-2620v2 from node #6 and replace them with
> >>>>>> Dual E5-2609v2(Which are much smaller in core and speed) and the
> >>>>>> 450 MB/s we got from he rados bench went even lower to 180 MB/s. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_END
> >>>>>> You really don't have to swap CPUs around, monitor things with
> >>>>>> atop or other tools to see where your bottlenecks are. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_BEGIN
> >>>>>> So Im wondering if the 1000MB/s we got when the Journal was shared
> >>>>>> on the OSD SSD was not limited by the CPUs (even though the
> >>>>>> samsung are not good for journals on the long run) and not just
> >>>>>> by the fact Samsung SSD are bad in O_DSYNC writes(or maybe both).
> >>>>>> It is probable that 16 SSD OSD per node in a full SSD cluster is
> >>>>>> too much and the major bottleneck will be from the CPU. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_END
> >>>>>> That's what I kept saying. ^.^ 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_BEGIN
> >>>>>> 4.Im wondering if we find good SSD for the journal and keep the
> >>>>>> samsung for normal writes and read(We can saturate 20GbE easy with
> >>>>>> read benchmark. We will test 40GbE soon) if the cluster will keep
> >>>>>> healthy since Samsung seem to get burnt from O_DSYNC writes. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_END
> >>>>>> They will get burned, as in have their cells worn out by any and
> >>>>>> all writes. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_BEGIN
> >>>>>> 5.In term of HBA controller, did you guys have made any test for a
> >>>>>> full SSD cluster or even just for SSD Journal. 
> >>>>>> 
> >>>>>> 
> >>>>>> BQ_END
> >>>>>> If you have separate journals and OSDs, it often makes good sense
> >>>>>> to have them on separate controllers as well. 
> >>>>>> It all depends on density of your setup and capabilities of the 
> >>>>>> controllers. 
> >>>>>> LSI HBAs in IT mode are a known and working entity. 
> >>>>>> 
> >>>>>> Christian 
> >>>>> 
> >>>>> 
> >>>>> -- 
> >>>>> Christian Balzer        Network/Systems Engineer                
> >>>>> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> >>>>> http://www.gol.com/
> >>>> 
> >>> 
> >>> 
> >>> -- 
> >>> Christian Balzer        Network/Systems Engineer                
> >>> chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> >>> http://www.gol.com/
> >> 
> > 
> > 
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com