Re: Possible improvements for a slow write speed (excluding independent SSD journals)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We redid the test with 4MB Block Size (using the same command as before but with 4MB for the BS) and we are getting better result from all devices:

Intel DC S3500 120GB = 		148 MB/s
Samsung Pro 128GB =		187 MB/s
Intel 520 120GB =		154 MB/s
Samsung EVO 1TB = 		186 MB/s
Intel DC S3500 300GB =		250 MB/s
I have not tested the DC S3610 yet but I will be ordering some soon
Since previously we had the journal and OSD on the same SSD Im still wondering if having the journal separate from the SSD (with a ration of 1:3 or 1:4) will actually bring more Write speed.
This is the configuration I was thinking of if we separate the Journal from the OSD:

÷Each OSD_Node÷
Dual E5-2620v2 with 64GB of RAM
-------------------
HBA 9207-8i #1
3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610 200GB for the Journal
3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610 200GB for the Journal
-------------------
HBA 9207-8i #2
3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610 200GB for the Journal
3x1TB Samsung 1TB for the Storage layer + 1 Intel S3610 200GB for the Journal
-------------------
1x LSI RAID Card + 2x 120GB SSD (For OS)
2x 10GbE dual port

There would be between 6-8 OSD Node like this to start the cluster.

My goal would be to max out at least 20 Gbps switch ports in writes to a single OpenStack Compute node. (Im still not sure about the CPU capacity)

As anyone testes a similar environment?

Anyway guys, lets me know what you think since we are still testing this POC.
---
Anthony Lévesque


On Apr 25, 2015, at 11:46 PM, Christian Balzer <chibi@xxxxxxx> wrote:


Hello,

I think that the dd test isn't a 100% replica of what Ceph actually does
then.
My suspicion would be the 4k blocks, since when people test the maximum
bandwidth they do it with rados bench or other tools that write the
optimum sized "blocks" for Ceph, 4MB ones.

I currently have no unused DC S3700s to do a realistic comparison and the
DC S3500 I have aren't used in any Ceph environment.

When testing a 200GB DC S3700 that has specs of 35K write IOPS and 365MB/s
sequential writes on mostly idle system (but on top of Ext4, not the raw
device) with a 4k dd dsync test run, atop and iostat show a 70% SSD
utilization, 30k IOPS and 70MB/s writes.
Which matches the specs perfectly.
If I do that test with 4MB blocks, the speed goes up to 330MB/s and 90%
SSD utilization according to atop, again on par with the specs.

Lastly on existing Ceph clusters with DC S3700 SSDs as journals and rados
bench and its 4MB default size that pattern continues.
Smaller sizes with rados naturally (at least on my hardware and Ceph
version, Firefly) run into the limitations of Ceph long before they hit
the SSDs (nearly 100% busy cores, journals at 4-8%, OSD HDDs anywhere from
50-100%).

Of course using the same dd test over all brands will still give you a
good comparison of the SSDs capabilities.
But translating that into actual Ceph journal performance is another thing.

Christian

On Sat, 25 Apr 2015 18:32:30 +0200 (CEST) Alexandre DERUMIER wrote:

I'm able to reach around 20000-25000iops with 4k block with s3500 (with
o_dsync) (so yes, around 80-100MB/S).

I'l bench new s3610 soon to compare.


----- Mail original -----
De: "Anthony Levesque" <alevesque@xxxxxxxxxx>
À: "Christian Balzer" <chibi@xxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Envoyé: Vendredi 24 Avril 2015 22:00:44
Objet: Re: Possible improvements for a slow write
speed (excluding independent SSD journals)

Hi Christian,

We tested some DC S3500 300GB using dd if=randfile of=/dev/sda bs=4k
count=100000 oflag=direct,dsync

we got 96 MB/s which is far from the 315 MB/s from the website.

Can I ask you or anyone on the mailing list how you are testing the
write speed for journals?

Thanks
---
Anthony Lévesque
GloboTech Communications
Phone: 1-514-907-0050 x 208
Toll Free: 1-(888)-GTCOMM1 x 208
Phone Urgency: 1-(514) 907-0047
1-(866)-500-1555
Fax: 1-(514)-907-0750
alevesque@xxxxxxxxxx
http://www.gtcomm.net




On Apr 23, 2015, at 9:05 PM, Christian Balzer < chibi@xxxxxxx > wrote:


Hello,

On Thu, 23 Apr 2015 18:40:38 -0400 Anthony Levesque wrote:


BQ_BEGIN
To update you on the current test in our lab:

1.We tested the Samsung OSD in Recovery mode and the speed was able to
maxout 2x 10GbE port(transferring data at 2200+ MB/s during recovery).
So for normal write operation without O_DSYNC writes Samsung drives seem
ok.

2.We then tested a couple of different model of SSD we had in stock with
the following command:

dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync

This was from a blog written by Sebastien Han and I think should be able
to show how the drives would perform in O_DSYNC writes. For people
interested in some result of what we tested here they are:

Intel DC S3500 120GB = 114 MB/s
Samsung Pro 128GB = 2.4 MB/s
WD Black 1TB (HDD) = 409 KB/s
Intel 330 120GB = 105 MB/s
Intel 520 120GB = 9.4 MB/s
Intel 335 80GB = 9.4 MB/s
Samsung EVO 1TB = 2.5 MB/s
Intel 320 120GB = 78 MB/s
OCZ Revo Drive 240GB = 60.8 MB/s
4x Samsung EVO 1TB LSI RAID0 HW + BBU = 28.4 MB/s



No real surprises here, but a nice summary nonetheless.

You _really_ want to avoid consumer SSDs for journals and have a good
idea on how much data you'll write per day and how long you expect your
SSDs to last (the TBW/$ ratio).


BQ_BEGIN
Please let us know if the command we ran was not optimal to test O_DSYNC
writes

We order larger drive from Intel DC series to see if we could get more
than 200 MB/s per SSD. We will keep you posted on tests if that
interested you guys. We dint test multiple parallel test yet (to
simulate multiple journal on one SSD).


BQ_END
You can totally trust the numbers on Intel's site:
http://ark.intel.com/products/family/83425/Data-Center-SSDs

The S3500s are by far the slowest and have the lowest endurance.
Again, depending on your expected write level the S3610 or S3700 models
are going to be a better fit regarding price/performance.
Especially when you consider that loosing a journal SSD will result in
several dead OSDs.


BQ_BEGIN
3.We remove the Journal from all Samsung OSD and put 2x Intel 330 120GB
on all 6 Node to test. The overall speed we were getting from the rados
bench went from 1000 MB/s(approx.) to 450 MB/s which might only be
because the intel cannot do too much in term of journaling (was tested
at around 100 MB/s). It will be interesting to test with bigger Intel
DC S3500 drives(and more journals) per node to see if I can back up to
1000MB/s or even surpass it.

We also wanted to test if the CPU could be a huge bottle neck so we swap
the Dual E5-2620v2 from node #6 and replace them with Dual
E5-2609v2(Which are much smaller in core and speed) and the 450 MB/s we
got from he rados bench went even lower to 180 MB/s.


BQ_END
You really don't have to swap CPUs around, monitor things with atop or
other tools to see where your bottlenecks are.


BQ_BEGIN
So Im wondering if the 1000MB/s we got when the Journal was shared on
the OSD SSD was not limited by the CPUs (even though the samsung are not
good for journals on the long run) and not just by the fact Samsung SSD
are bad in O_DSYNC writes(or maybe both). It is probable that 16 SSD
OSD per node in a full SSD cluster is too much and the major bottleneck
will be from the CPU.


BQ_END
That's what I kept saying. ^.^


BQ_BEGIN
4.Im wondering if we find good SSD for the journal and keep the samsung
for normal writes and read(We can saturate 20GbE easy with read
benchmark. We will test 40GbE soon) if the cluster will keep healthy
since Samsung seem to get burnt from O_DSYNC writes.


BQ_END
They will get burned, as in have their cells worn out by any and all
writes.


BQ_BEGIN
5.In term of HBA controller, did you guys have made any test for a full
SSD cluster or even just for SSD Journal.


BQ_END
If you have separate journals and OSDs, it often makes good sense to
have them on separate controllers as well.
It all depends on density of your setup and capabilities of the
controllers.
LSI HBAs in IT mode are a known and working entity.

Christian


--
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx    Global OnLine Japan/Fusion Communications
http://www.gol.com/

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux