Re: SSD randwrite performance

Christian Balzer <chibi@xxxxxxx> · Thu, 26 May 2016 16:01:27 +0900

Hello,

On Wed, 25 May 2016 11:30:03 +0300 Max A. Krasilnikov wrote:

> Hello! 
> 
> On Wed, May 25, 2016 at 11:45:29AM +0900, chibi wrote:
> 
> 
> > Hello,
> 
> > On Tue, 24 May 2016 21:20:49 +0300 Max A. Krasilnikov wrote:
> 
> >> Hello!
> >> 
> >> I have cluster with 5 SSD drives as OSD backed by SSD journals, one
> >> per osd. One osd per node.
> >> 
> > More details will help identify other potential bottlenecks, such as:
> > CPU/RAM
> > Kernel, OS version.
> 
> For now I have 3x(Openstack controller + ceph mon + 8xOSD (one for
> SSD)). All running Ubuntu 14.04+Hammer from ubuntu-cloud, now moving to
> Ubuntu 14.04+Ceph Jewel from Ceph site.
> E5-2620 v2 (12 cores)
With SSDs faster cores are definitely better, but as said, that's not your
main problem probably.
Setting the governor to "performance" helps with latency.

Again, have you run atop on your OSD nodes while doing those tests? 
Are the SSDs very busy (near/at 100%) or is it the CPUs?

> 32G RAM
So this is just one OSD per node, right? Should be enough then.

> Linux 4.2.0, moving to 4.4 from Xenial.
> 
> >> Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G,
> >> journal partition is 24GB, data partition is 790GB. OSD nodes
> >> connected by 2x10Gbps linux bonding for data/cluster network.
> >>
> > As Oliver wrote, these SSDs are totally unsuited for usage with Ceph,
> > especially regarding to journals. 
> > But also in general, since they're neither handling IOPS in a
> > consistent, predictable manner.
> > And they're not durable (endurance, TBW) enough either.
> 
> Yep, I understand. But on second cluster w/ ScaleIO they do much
> better :(
> 
Well, if one believes the hype (mostly from EMC though) about ScaleIO it's
n times better than Ceph and even better than sliced bread. </sarcasm>

But even if ScaleIO code/design/architecture is so much better than Ceph, 
these SSDs are still not something you ever want to use in a production
environment, they have unpredictable performance (and degradation
potentially) and most of all will wear out quickly.
Also there have been reports here with EVOs dying long long before they
were supposed according to their wear-out levels.

Lastly, no matter what ScaleIO does, at some point it better do a SYNC
write to its "disks" to have a safe checkpoint, so that performance part
of these SSDs comes to bear as well.

> > When using SSDs or NVMes, use DC level ones exclusively, Intel is the
> > more tested one in these parts, but the Samsung DC level ones ought to
> > be fine, too.
> 
> I can hope, my employer will provide me with them, but for now i have to
> do all the best with current hardware :(
> 
Then you have no real way to improve things much.

> >  
> >> When doing random write with 4k blocks with direct=1, buffered=0,
> >> iodepth=32..1024, ioengine=libaio from nova qemu virthost I can get no
> >> more than 9kiops. Randread is about 13-15 kiops.
> >> 
> >> Trouble is that randwrite not depends on iodepth. read, write can be
> >> up to 140kiops, randread up to 15 kiops. randwrite is always 2-9
> >> kiops.
> >> 
> > Aside from the limitations of your SSDs, there are other factors, like
> > CPU utilization.
> > And very importantly also network latency, but that's for single
> > threaded IOPS mostly.
> 
> >> Ceph cluster is mixed of jewel and hammer, upgrading now to jewel. On
> >> Hammer I got the same results.
> >> 
> > Mixed is a very bad state for a cluster to be in.
> 
> > Jewel has lots of improvements in that area, but w/o decent hardware
> > you may not see them.
> 
> My cluster is upgrading now. 2 OSD per night :), one node per week, with
> changing old 850EVO to new ones.
> 
Test again with a full Jewel cluster.

> >> All journals can do up to 32kiops with the same config for fio.
> >> 
> >> I am confused because EMC ScaleIO can do much more iops what is boring
> >> my boss :)
> >> 
> > There are lot of discussion and slides on how to improve/maximize IOPS
> > with Ceph, go search for them.
> 
> > Fast CPUs, jmalloc, pinning, configuration, NVMes for journals, etc.
> 
> I have seen a lot of them. Will try to use pinning, I have never used it
> before.
> 
Pinning is one of the last things you do, it won't help you much unless
you (fast!) CPUs are already maxed out while your SSDs are not.

Christian 
> > Christian
> > -- 
> > Christian Balzer        Network/Systems Engineer                
> > chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com