Re: Yet another performance tuning for CephFS

"Gencer W. Genç" <gencer@xxxxxxxxxxxxx> · Tue, 18 Jul 2017 15:10:51 +0300

>> Are you sure? Your config didn't show this.

Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has seperate network that have 10GbE network card and speed. Do I have to set anything in the config for 10GbE?

>> What kind of devices are they? did you do the journal test?
They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard Disk Drives (HDD).

-Gencer.

-----Original Message-----
From: Peter Maloney [mailto:peter.maloney@xxxxxxxxxxxxxxxxxxxx] 
Sent: Tuesday, July 18, 2017 2:47 PM
To: gencer@xxxxxxxxxxxxx
Cc: ceph-users@xxxxxxxxxxxxxx
Subject: Re:  Yet another performance tuning for CephFS

On 07/17/17 22:49, gencer@xxxxxxxxxxxxx wrote:
> I have a seperate 10GbE network for ceph and another for public.
>
Are you sure? Your config didn't show this.

> No they are not NVMe, unfortunately.
>
What kind of devices are they? did you do the journal test?
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

Unlike most tests, with ceph journals, you can't look at the load on the device and decide it's not the bottleneck; you have to test it another way. I had some micron SSDs I tested which performed poorly, and that test showed them performing poorly too. But from other benchmarks, and disk load during journal tests, they looked ok, which was misleading.
> Do you know any test command that i can try to see if this is the max.
> Read speed from rsync?
I don't know how you can improve your rsync test.
>
> Because I tried one thing a few minutes ago. I opened 4 ssh channel 
> and run rsync command and copy bigfile to different targets in cephfs 
> at the same time. Then i looked into network graphs and i see numbers 
> up to 1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What 
> prevents it im really wonder this.
>
> Gencer.
>
> On 2017-07-17 23:24, Peter Maloney wrote:
>> You should have a separate public and cluster network. And journal or 
>> wal/db performance is important... are the devices fast NVMe?
>>
>> On 07/17/17 21:31, gencer@xxxxxxxxxxxxx wrote:
>>
>>> Hi,
>>>
>>> I located and applied almost every different tuning setting/config 
>>> over the internet. I couldn’t manage to speed up my speed one byte 
>>> further. It is always same speed whatever I do.
>>>
>>> I was on jewel, now I tried BlueStore on Luminous. Still exact same 
>>> speed I gain from cephfs.
>>>
>>> It doesn’t matter if I disable debug log, or remove [osd] section as 
>>> below and re-add as below (see .conf). Results are exactly the same. 
>>> Not a single byte is gained from those tunings. I also did tuning 
>>> for kernel (sysctl.conf).
>>>
>>> Basics:
>>>
>>> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each 
>>> node has 24 cores and 64GB of RAM. Ceph nodes are connected via 
>>> 10GbE NIC. No FUSE used. But tried that too. Same results.
>>>
>>> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct
>>>
>>> 10+0 records in
>>>
>>> 10+0 records out
>>>
>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s
>>>
>>> 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm..
>>> I get much much much higher speeds on different filesystems. Even 
>>> with glusterfs. Is there anything I can do or try?
>>>
>>> Read speed is also around 180-220MB/s but not higher.
>>>
>>> This is What I am using on ceph.conf:
>>>
>>> [global]
>>>
>>> fsid = d7163667-f8c5-466b-88df-8747b26c91df
>>>
>>> mon_initial_members = server1
>>>
>>> mon_host = 192.168.0.1
>>>
>>> auth_cluster_required = cephx
>>>
>>> auth_service_required = cephx
>>>
>>> auth_client_required = cephx
>>>
>>> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>>
>>> osd_mkfs_type = xfs
>>>
>>> osd pool default size = 2
>>>
>>> enable experimental unrecoverable data corrupting features = 
>>> bluestore rocksdb
>>>
>>> bluestore fsck on mount = true
>>>
>>> rbd readahead disable after bytes = 0
>>>
>>> rbd readahead max bytes = 4194304
>>>
>>> log to syslog = false
>>>
>>> debug_lockdep = 0/0
>>>
>>> debug_context = 0/0
>>>
>>> debug_crush = 0/0
>>>
>>> debug_buffer = 0/0
>>>
>>> debug_timer = 0/0
>>>
>>> debug_filer = 0/0
>>>
>>> debug_objecter = 0/0
>>>
>>> debug_rados = 0/0
>>>
>>> debug_rbd = 0/0
>>>
>>> debug_journaler = 0/0
>>>
>>> debug_objectcatcher = 0/0
>>>
>>> debug_client = 0/0
>>>
>>> debug_osd = 0/0
>>>
>>> debug_optracker = 0/0
>>>
>>> debug_objclass = 0/0
>>>
>>> debug_filestore = 0/0
>>>
>>> debug_journal = 0/0
>>>
>>> debug_ms = 0/0
>>>
>>> debug_monc = 0/0
>>>
>>> debug_tp = 0/0
>>>
>>> debug_auth = 0/0
>>>
>>> debug_finisher = 0/0
>>>
>>> debug_heartbeatmap = 0/0
>>>
>>> debug_perfcounter = 0/0
>>>
>>> debug_asok = 0/0
>>>
>>> debug_throttle = 0/0
>>>
>>> debug_mon = 0/0
>>>
>>> debug_paxos = 0/0
>>>
>>> debug_rgw = 0/0
>>>
>>> [osd]
>>>
>>> osd max write size = 512
>>>
>>> osd client message size cap = 2147483648
>>>
>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>
>>>
>>> filestore xattr use omap = true
>>>
>>> osd_op_threads = 8
>>>
>>> osd disk threads = 4
>>>
>>> osd map cache size = 1024
>>>
>>> filestore_queue_max_ops = 25000
>>>
>>> filestore_queue_max_bytes = 10485760
>>>
>>> filestore_queue_committing_max_ops = 5000
>>>
>>> filestore_queue_committing_max_bytes = 10485760000
>>>
>>> journal_max_write_entries = 1000
>>>
>>> journal_queue_max_ops = 3000
>>>
>>> journal_max_write_bytes = 1048576000
>>>
>>> journal_queue_max_bytes = 1048576000
>>>
>>> filestore_max_sync_interval = 15
>>>
>>> filestore_merge_threshold = 20
>>>
>>> filestore_split_multiple = 2
>>>
>>> osd_enable_op_tracker = false
>>>
>>> filestore_wbthrottle_enable = false
>>>
>>> osd_client_message_size_cap = 0
>>>
>>> osd_client_message_cap = 0
>>>
>>> filestore_fd_cache_size = 64
>>>
>>> filestore_fd_cache_shards = 32
>>>
>>> filestore_op_threads = 12
>>>
>>> As I stated above, it doesn’t matter if I have this [osd] section or 
>>> not. Results are same.
>>>
>>> I am open to all suggestions.
>>>
>>> Thanks,
>>>
>>> Gencer.
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

--------------------------------------------
Peter Maloney
Brockmann Consult
Max-Planck-Str. 2
21502 Geesthacht
Germany
Tel: +49 4152 889 300
Fax: +49 4152 889 333
E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
Internet: http://www.brockmann-consult.de
--------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com