Re: fio librbd result is poor

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



All of our DC S3500 and S3510 all ran out of writes this week after being in production for 1.5 years as journal drives to 4 disks each.  Having 43 drives say they have less than 1% of their writes left is scary. I'd recommend having a monitoring check for your ssds durability in Ceph.

As a note, the DC S3700 series is warrantied for almost 30x more writes than the S3500 series.

Sent from my iPhone

> On Dec 19, 2016, at 12:50 AM, Christian Balzer <chibi@xxxxxxx> wrote:
>
>
> Hello,
>
>> On Mon, 19 Dec 2016 15:05:05 +0800 (CST) mazhongming wrote:
>>
>> Hi Christian,
>> Thanks for your reply.
>>
>>
>> At 2016-12-19 14:01:57, "Christian Balzer" <chibi@xxxxxxx> wrote:
>>>
>>> Hello,
>>>
>>>> On Mon, 19 Dec 2016 13:29:07 +0800 (CST) 马忠明 wrote:
>>>>
>>>> Hi guys,
>>>>
>>>> So recently I was testing our ceph cluster which mainly used for block usage(rbd).
>>>>
>>>> We have 30 ssd drives total(5 storage nodes,6 ssd drives each node).However the result of fio is very poor.
>>>>
>>> All relevant details are missing.
>>> SSD exact models, CPU/RAM config, network config, Ceph, OS/kernel, fio
>>
>>> versions, the config you tested this with, as in replication.
>> SSD:Intel® SSD DC S3510 Series 1.2TB 2.5"
> Slower than mine, but not massively so and many more of them.
> But your distribution (CRUSH map based on 3 racks, right?) limits that
> number advantage.
> I'd expect them to be busy around 50-60% busy with the RBD engine fio.
>
> The endurance of 0.3 DPWD (0.1 really after in-line journals and other
> overhead like FS journals) would worry me.
> Are you monitoring their wear-out levels?
>
>> CPU:2×Intel E5-2630v4
> Slightly slower than the ones in my test cluster, but not significantly so.
>
>> MEM:128GB
>> Network config:2*10G bond4  LACP network connection
>> Ceph:Hammer 0.94.6
> I'd upgrade to the latest Hammer, just in case anybody ever plays with
> cache-tiering on there, which is deadly broken in that version.
>
>> OS/kernel:  Ubuntu 14.04.5 LTS/3.13.0-96-generic
> That kernel is a bit dated and vastly different than mine, but it
> shouldn't be any factor in the result.
>
>> Fio:2.12
>>
> Not missing a .1. in there?
>
> Fio 2.1.11 in my case, but I really dislike the RBD engine and the various
> bugs/inconsistencies people keep finding with it.
>
> Testing from within a (librbd backed) VM should be more realistic anyway.
>
> And this turns out to be one of these fio RBD engine corner cases, as I did
> run your fio command line against an image that was just 20GB in size.
>
> When running from a VM with libaio or with a reduced test size of 5GB
> the IOPS came down to about 8500, still faster then your but only 2x
> instead of 4x.
>
>
>>
>>>
>>>> We tested the workload on ssd pool with following parameter :
>>>>
>>>> "fio --size=50G \
>>>>
>>>>       --ioengine=rbd \
>>>>
>>>>       --direct=1 \
>>>>
>>>>       --numjobs=1 \
>>>>
>>>>       --rw=randwrite(randread) \
>>>>
>>>>       --name=com_ssd_4k_randwrite(randread) \
>>>>
>>>>       --bs=4k \
>>>>
>>>>       --iodepth=32 \
>>>>
>>>>       --pool=ssd_volumes \
>>>>
>>>>       --runtime=60 \
>>>>
>>>>       --ramp_time=30 \
>>>>
>>>> --rbdname=4k_test_image"
>>>>
>>>> and here is the result:
>>>>
>>>> random write:4631;random read:21127
>>>>
>>>>
>>>>
>>>>
>>>> I also tested  the pool(size=1,min_size=1,pg_num=256) which is consisted by  only one single ssd drive with same workload pattern which is more acceptable.(random write:8303;random read:27859)
>>>>
>>> I'm only going to comment on the write part.
>>>
>>> On my staging cluster (* see below) I ran your fio against the cache tier
>>> (so only SSDs involved) with this result:
>>>
>>> write: io=4206.3MB, bw=71784KB/s, iops=17945, runt= 60003msec
>>>   slat (usec): min=0, max=531, avg= 3.26, stdev=11.33
>>>   clat (usec): min=5, max=41996, avg=1770.23, stdev=2260.61
>>>    lat (usec): min=9, max=41997, avg=1773.36, stdev=2260.60
>>>
>>> So more than 2 times better than your non-replicated test.
>>>
>>> 4k randwrites stress the CPUs (run atop or such on your OSD nodes
>>> when doing a test run), so this might be your limit here.
>>> Along with less than optimal SSDs or a high latency network.
>>
>>>
>> yes...CPU usage might be  the bottleneck of the whole system.BTW,our ceph cluster is combined with mirantis openstack,above result ran from one computer node.And I also ran pressure test with all 10 computer node.The result is almost same and cpu usage for all storage node  is nearly 50-60%.the cpu usage for every ssd osd is nearly 250-300%.
>>
>
> Yes, the OSD 300% CPU usage looks familiar.
> The hammer code seems to peter out there, even if there's still a core or
> 2 available.
>
> The Ceph latency is something that's obviously being addressed by the
> developers.
> Check the archives and google (Nick Fisk) for how to tune up your CPU
> settings to get every last IOPS from your HW.
>
> Another thing to always remember here is that you're testing network
> latency as well when running fio with direct=1 against Ceph, the local RBD
> cache is bypassed, so you're constrained by how long it takes for the
> network round-trips (which is of course significantly longer than a local
> SATA cable) and the Ceph latency (code and CPU speeds).
>
> Christian
>
>>
>> pool parameter for ssd_volomes(size=3,min_size=1,pg_num 2048 pgp_num 2048)
>>
>>
>>
>>
>>> Christian
>>>
>>>
>>> * Staging cluster:
>>> ---
>>> 4 nodes running latest Hammer under Debian Jessie (with sysvinit, kernel
>>> 4.6) and manually created OSDs.
>>> Infiniband (IPoIB) QDR (40Gb/s, about 30Gb/s effective) between all nodes.
>>>
>>> 2 HDD OSD nodes with 32GB RAM, fast enough CPU (E5-2620 v3), 2x 200GB DC S3610 for
>>> OS and journals (2 per SSD), 4x 1GB 2.5" SATAs for OSDs.
>>> For my amusement and edification the OSDs of one node are formatted with
>>> XFS, the other one EXT4 (as all my production clusters).
>>>
>>> The 2 SSD ODS nodes have 1x 200GB DC S3610 (OS and 4 journal partitions)
>>> and 2x 400GB DC S3610s (2 180GB partitions, so 8 SSD OSDs total), same
>>> specs as the HDD nodes otherwise.
>>> Also one node with XFS, the other EXT4.
>>>
>>> Pools are size=2, min_size=1, obviously.
>>> ---
>>>
>>>>
>>>>
>>>>
>>>> We have optimized the linux kernal(read_ahead,disk_scheduler,numa,swappiness) and ceph.conf(client_message,filestore_queue,journal_queue,rbd_cache).And checked the raid cache setting.
>>>>
>>>>
>>>>
>>>>
>>>> The only deficiency for the architecture is the unbalance weight between three racks which one rack has only one storage node.
>>>>
>>>>
>>>>
>>>>
>>>> So can anybody tell us whether  this  number is reasonable.If not,any suggestion to improve the number will be appreciated.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Christian Balzer        Network/Systems Engineer
>>> chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications
>>> http://www.gol.com/
>
>
> --
> Christian Balzer        Network/Systems Engineer
> chibi@xxxxxxx       Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>


David Turner | Cloud Operations Engineer | StorageCraft Technology Corporation
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760 | Mobile: 385.224.2943


If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.


_______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux