Re: SSD randwrite performance

"Max A. Krasilnikov" <pseudo@xxxxxxxxxxxx> · Wed, 25 May 2016 11:30:03 +0300

Hello! 

On Wed, May 25, 2016 at 11:45:29AM +0900, chibi wrote:

> Hello,

> On Tue, 24 May 2016 21:20:49 +0300 Max A. Krasilnikov wrote:

>> Hello!
>> 
>> I have cluster with 5 SSD drives as OSD backed by SSD journals, one per
>> osd. One osd per node.
>> 
> More details will help identify other potential bottlenecks, such as:
> CPU/RAM
> Kernel, OS version.

For now I have 3x(Openstack controller + ceph mon + 8xOSD (one for SSD)). All
running Ubuntu 14.04+Hammer from ubuntu-cloud, now moving to Ubuntu 14.04+Ceph
Jewel from Ceph site.
E5-2620 v2 (12 cores)
32G RAM
Linux 4.2.0, moving to 4.4 from Xenial.

>> Data drives is Samsung 850 EVO 1TB, journals are Samsung 850 EVO 250G,
>> journal partition is 24GB, data partition is 790GB. OSD nodes connected
>> by 2x10Gbps linux bonding for data/cluster network.
>>
> As Oliver wrote, these SSDs are totally unsuited for usage with Ceph,
> especially regarding to journals. 
> But also in general, since they're neither handling IOPS in a consistent,
> predictable manner.
> And they're not durable (endurance, TBW) enough either.

Yep, I understand. But on second cluster w/ ScaleIO they do much better :(

> When using SSDs or NVMes, use DC level ones exclusively, Intel is the more
> tested one in these parts, but the Samsung DC level ones ought to be fine,
> too.

I can hope, my employer will provide me with them, but for now i have to do all
the best with current hardware :(

>  
>> When doing random write with 4k blocks with direct=1, buffered=0,
>> iodepth=32..1024, ioengine=libaio from nova qemu virthost I can get no
>> more than 9kiops. Randread is about 13-15 kiops.
>> 
>> Trouble is that randwrite not depends on iodepth. read, write can be up
>> to 140kiops, randread up to 15 kiops. randwrite is always 2-9 kiops.
>> 
> Aside from the limitations of your SSDs, there are other factors, like CPU
> utilization.
> And very importantly also network latency, but that's for single threaded
> IOPS mostly.

>> Ceph cluster is mixed of jewel and hammer, upgrading now to jewel. On
>> Hammer I got the same results.
>> 
> Mixed is a very bad state for a cluster to be in.

> Jewel has lots of improvements in that area, but w/o decent hardware you
> may not see them.

My cluster is upgrading now. 2 OSD per night :), one node per week, with
changing old 850EVO to new ones.

>> All journals can do up to 32kiops with the same config for fio.
>> 
>> I am confused because EMC ScaleIO can do much more iops what is boring
>> my boss :)
>> 
> There are lot of discussion and slides on how to improve/maximize IOPS
> with Ceph, go search for them.

> Fast CPUs, jmalloc, pinning, configuration, NVMes for journals, etc.

I have seen a lot of them. Will try to use pinning, I have never used it before.

> Christian
> -- 
> Christian Balzer        Network/Systems Engineer                
> chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
> http://www.gol.com/

-- 
WBR, Max A. Krasilnikov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com