Re: Yet another performance tuning for CephFS

Ansgar Jazdzewski <a.jazdzewski@xxxxxxxxxxxxxx> · Tue, 18 Jul 2017 14:44:12 +0200

Hi,

i will try to join in and help, as far es i got you you only have
HDD's in your cluster? you use the journal on the HDD? and you have a
replication of 3 set on your pools?

with that in mind you can do some calulations ceph need to:

1. write the data and metadata into the journal
2. copy the data over the backend networke two time to the other OSD
3. write into the journal on the secondary OSD
4. wait for the ACK of all OSD.

with your setup you cann assume the you have ~ 1/4 write-speed of your
HDD and with only one client you can't make use of the scale-out

if you can, you should add SSD's for the journal and the cephfs
metadata you can also considre to build a cache-tier for the
cephfs-data-pool.

i hope it helps a bit,
Ansgar

2017-07-18 14:10 GMT+02:00 Gencer W. Genç <gencer@xxxxxxxxxxxxx>:
>>> Are you sure? Your config didn't show this.
>
> Yes. I have dedicated 10GbE network between ceph nodes. Each ceph node has seperate network that have 10GbE network card and speed. Do I have to set anything in the config for 10GbE?
>
>>> What kind of devices are they? did you do the journal test?
> They are not connected via NVMe neither SSD's. Each node has 10x3TB SATA Hard Disk Drives (HDD).
>
>
> -Gencer.
>
>
> -----Original Message-----
> From: Peter Maloney [mailto:peter.maloney@xxxxxxxxxxxxxxxxxxxx]
> Sent: Tuesday, July 18, 2017 2:47 PM
> To: gencer@xxxxxxxxxxxxx
> Cc: ceph-users@xxxxxxxxxxxxxx
> Subject: Re:  Yet another performance tuning for CephFS
>
> On 07/17/17 22:49, gencer@xxxxxxxxxxxxx wrote:
>> I have a seperate 10GbE network for ceph and another for public.
>>
> Are you sure? Your config didn't show this.
>
>> No they are not NVMe, unfortunately.
>>
> What kind of devices are they? did you do the journal test?
> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Unlike most tests, with ceph journals, you can't look at the load on the device and decide it's not the bottleneck; you have to test it another way. I had some micron SSDs I tested which performed poorly, and that test showed them performing poorly too. But from other benchmarks, and disk load during journal tests, they looked ok, which was misleading.
>> Do you know any test command that i can try to see if this is the max.
>> Read speed from rsync?
> I don't know how you can improve your rsync test.
>>
>> Because I tried one thing a few minutes ago. I opened 4 ssh channel
>> and run rsync command and copy bigfile to different targets in cephfs
>> at the same time. Then i looked into network graphs and i see numbers
>> up to 1.09 gb/s. But why single copy/rsync cannot exceed 200mb/s? What
>> prevents it im really wonder this.
>>
>> Gencer.
>>
>> On 2017-07-17 23:24, Peter Maloney wrote:
>>> You should have a separate public and cluster network. And journal or
>>> wal/db performance is important... are the devices fast NVMe?
>>>
>>> On 07/17/17 21:31, gencer@xxxxxxxxxxxxx wrote:
>>>
>>>> Hi,
>>>>
>>>> I located and applied almost every different tuning setting/config
>>>> over the internet. I couldn’t manage to speed up my speed one byte
>>>> further. It is always same speed whatever I do.
>>>>
>>>> I was on jewel, now I tried BlueStore on Luminous. Still exact same
>>>> speed I gain from cephfs.
>>>>
>>>> It doesn’t matter if I disable debug log, or remove [osd] section as
>>>> below and re-add as below (see .conf). Results are exactly the same.
>>>> Not a single byte is gained from those tunings. I also did tuning
>>>> for kernel (sysctl.conf).
>>>>
>>>> Basics:
>>>>
>>>> I have 2 nodes with 10 OSD each and each OSD is 3TB SATA drive. Each
>>>> node has 24 cores and 64GB of RAM. Ceph nodes are connected via
>>>> 10GbE NIC. No FUSE used. But tried that too. Same results.
>>>>
>>>> $ dd if=/dev/zero of=/mnt/c/testfile bs=100M count=10 oflag=direct
>>>>
>>>> 10+0 records in
>>>>
>>>> 10+0 records out
>>>>
>>>> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 5.77219 s, 182 MB/s
>>>>
>>>> 182MB/s. This is the best speed i get so far. Usually 170~MB/s. Hm..
>>>> I get much much much higher speeds on different filesystems. Even
>>>> with glusterfs. Is there anything I can do or try?
>>>>
>>>> Read speed is also around 180-220MB/s but not higher.
>>>>
>>>> This is What I am using on ceph.conf:
>>>>
>>>> [global]
>>>>
>>>> fsid = d7163667-f8c5-466b-88df-8747b26c91df
>>>>
>>>> mon_initial_members = server1
>>>>
>>>> mon_host = 192.168.0.1
>>>>
>>>> auth_cluster_required = cephx
>>>>
>>>> auth_service_required = cephx
>>>>
>>>> auth_client_required = cephx
>>>>
>>>> osd mount options = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>>
>>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>>
>>>>
>>>> osd_mkfs_type = xfs
>>>>
>>>> osd pool default size = 2
>>>>
>>>> enable experimental unrecoverable data corrupting features =
>>>> bluestore rocksdb
>>>>
>>>> bluestore fsck on mount = true
>>>>
>>>> rbd readahead disable after bytes = 0
>>>>
>>>> rbd readahead max bytes = 4194304
>>>>
>>>> log to syslog = false
>>>>
>>>> debug_lockdep = 0/0
>>>>
>>>> debug_context = 0/0
>>>>
>>>> debug_crush = 0/0
>>>>
>>>> debug_buffer = 0/0
>>>>
>>>> debug_timer = 0/0
>>>>
>>>> debug_filer = 0/0
>>>>
>>>> debug_objecter = 0/0
>>>>
>>>> debug_rados = 0/0
>>>>
>>>> debug_rbd = 0/0
>>>>
>>>> debug_journaler = 0/0
>>>>
>>>> debug_objectcatcher = 0/0
>>>>
>>>> debug_client = 0/0
>>>>
>>>> debug_osd = 0/0
>>>>
>>>> debug_optracker = 0/0
>>>>
>>>> debug_objclass = 0/0
>>>>
>>>> debug_filestore = 0/0
>>>>
>>>> debug_journal = 0/0
>>>>
>>>> debug_ms = 0/0
>>>>
>>>> debug_monc = 0/0
>>>>
>>>> debug_tp = 0/0
>>>>
>>>> debug_auth = 0/0
>>>>
>>>> debug_finisher = 0/0
>>>>
>>>> debug_heartbeatmap = 0/0
>>>>
>>>> debug_perfcounter = 0/0
>>>>
>>>> debug_asok = 0/0
>>>>
>>>> debug_throttle = 0/0
>>>>
>>>> debug_mon = 0/0
>>>>
>>>> debug_paxos = 0/0
>>>>
>>>> debug_rgw = 0/0
>>>>
>>>> [osd]
>>>>
>>>> osd max write size = 512
>>>>
>>>> osd client message size cap = 2147483648
>>>>
>>>> osd mount options xfs = rw,noexec,nodev,noatime,nodiratime,nobarrier
>>>>
>>>>
>>>> filestore xattr use omap = true
>>>>
>>>> osd_op_threads = 8
>>>>
>>>> osd disk threads = 4
>>>>
>>>> osd map cache size = 1024
>>>>
>>>> filestore_queue_max_ops = 25000
>>>>
>>>> filestore_queue_max_bytes = 10485760
>>>>
>>>> filestore_queue_committing_max_ops = 5000
>>>>
>>>> filestore_queue_committing_max_bytes = 10485760000
>>>>
>>>> journal_max_write_entries = 1000
>>>>
>>>> journal_queue_max_ops = 3000
>>>>
>>>> journal_max_write_bytes = 1048576000
>>>>
>>>> journal_queue_max_bytes = 1048576000
>>>>
>>>> filestore_max_sync_interval = 15
>>>>
>>>> filestore_merge_threshold = 20
>>>>
>>>> filestore_split_multiple = 2
>>>>
>>>> osd_enable_op_tracker = false
>>>>
>>>> filestore_wbthrottle_enable = false
>>>>
>>>> osd_client_message_size_cap = 0
>>>>
>>>> osd_client_message_cap = 0
>>>>
>>>> filestore_fd_cache_size = 64
>>>>
>>>> filestore_fd_cache_shards = 32
>>>>
>>>> filestore_op_threads = 12
>>>>
>>>> As I stated above, it doesn’t matter if I have this [osd] section or
>>>> not. Results are same.
>>>>
>>>> I am open to all suggestions.
>>>>
>>>> Thanks,
>>>>
>>>> Gencer.
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users@xxxxxxxxxxxxxx
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
>
> --------------------------------------------
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: peter.maloney@xxxxxxxxxxxxxxxxxxxx
> Internet: http://www.brockmann-consult.de
> --------------------------------------------
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com