Re: SSD randwrite performance

"Max A. Krasilnikov" <pseudo@xxxxxxxxxxxx> · Thu, 26 May 2016 11:45:21 +0300

Hello! 

On Thu, May 26, 2016 at 04:01:27PM +0900, chibi wrote:

> >>> I have cluster with 5 SSD drives as OSD backed by SSD journals, one
> >>> per osd. One osd per node.
> >>> 
> >> More details will help identify other potential bottlenecks, such as:
> >> CPU/RAM
> >> Kernel, OS version.
>> 
>> For now I have 3x(Openstack controller + ceph mon + 8xOSD (one for
>> SSD)). All running Ubuntu 14.04+Hammer from ubuntu-cloud, now moving to
>> Ubuntu 14.04+Ceph Jewel from Ceph site.
>> E5-2620 v2 (12 cores)
> With SSDs faster cores are definitely better, but as said, that's not your
> main problem probably.
> Setting the governor to "performance" helps with latency.

root@storage001:~# cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
performance

For all processors.

> Again, have you run atop on your OSD nodes while doing those tests? 
> Are the SSDs very busy (near/at 100%) or is it the CPUs?

Yes, sometimes over 100% (101 is not a rare thing when backfilling). But only
60-100 MBps at the same time, less than 1000 writes per sec.

>> 32G RAM
> So this is just one OSD per node, right? Should be enough then.

I would be glad to say "yes", but no. 32G RAM for 8 OSD per node + Openstack
controller + Openstack network node here. 3 of these OSDs is 6TB HDD, 1 is 1TB
SSD, 4x 2TB HDD.
Journals is a partitions on 2 SSDs. First partitions of these SSDs are coupled
to linux mdraid leve1 for system.

>> Linux 4.2.0, moving to 4.4 from Xenial.
>> 
> >>> Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G,
> >>> journal partition is 24GB, data partition is 790GB. OSD nodes
> >>> connected by 2x10Gbps linux bonding for data/cluster network.
> >>>
> >> As Oliver wrote, these SSDs are totally unsuited for usage with Ceph,
> >> especially regarding to journals. 
> >> But also in general, since they're neither handling IOPS in a
> >> consistent, predictable manner.
> >> And they're not durable (endurance, TBW) enough either.
>> 
>> Yep, I understand. But on second cluster w/ ScaleIO they do much
>> better :(
>> 
> Well, if one believes the hype (mostly from EMC though) about ScaleIO it's
> n times better than Ceph and even better than sliced bread. </sarcasm>

> But even if ScaleIO code/design/architecture is so much better than Ceph, 
> these SSDs are still not something you ever want to use in a production
> environment, they have unpredictable performance (and degradation
> potentially) and most of all will wear out quickly.
> Also there have been reports here with EVOs dying long long before they
> were supposed according to their wear-out levels.

> Lastly, no matter what ScaleIO does, at some point it better do a SYNC
> write to its "disks" to have a safe checkpoint, so that performance part
> of these SSDs comes to bear as well.

As I understand, sio does not sync unbuffered writes. But my employer belives in
miracles.

>> My cluster is upgrading now. 2 OSD per night :), one node per week, with
>> changing old 850EVO to new ones.
>> 
> Test again with a full Jewel cluster.

I will. And I will report.

-- 
WBR, Max A. Krasilnikov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com