Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

Igor Fedotov <igor.fedotov@xxxxxxxx> · Wed, 3 May 2023 13:10:30 +0300

On 5/2/2023 9:02 PM, Nikola Ciprich wrote:

hewever, probably worh noting, historically we're using following OSD options:
ceph config set osd bluestore_rocksdb_options compression=kNoCompression,max_write_buffer_number=32,min_write_buffer_number_to_merge=2,recycle_log_file_num=32,compaction_style=kCompactionStyleLevel,write_buffer_size=67108864,target_file_size_base=67108864,max_background_compactions=31,level0_file_num_compaction_trigger=8,level0_slowdown_writes_trigger=32,level0_stop_writes_trigger=64,max_bytes_for_level_base=536870912,compaction_threads=32,max_bytes_for_level_multiplier=8,flusher_threads=8,compaction_readahead_size=2MB
ceph config set osd bluestore_cache_autotune 0
ceph config set osd bluestore_cache_size_ssd 2G
ceph config set osd bluestore_cache_kv_ratio 0.2
ceph config set osd bluestore_cache_meta_ratio 0.8
ceph config set osd osd_min_pg_log_entries 10
ceph config set osd osd_max_pg_log_entries 10
ceph config set osd osd_pg_log_dups_tracked 10
ceph config set osd osd_pg_log_trim_min 10

so maybe I'll start resetting those to defaults (ie enabling cache autotune etc)
as a first step..

Generally I wouldn't recommend using non-default settings unless there 
are explicit rationales. So yeah better to revert to defaults whenever 
possible.

I doubt this is a root cause for your issue though..

Thanks,

Igor

On 5/2/2023 11:32 AM, Nikola Ciprich wrote:
Hello dear CEPH users and developers,

we're dealing with strange problems.. we're having 12 node alma linux 9 cluster,
initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch
of KVM virtual machines accessing volumes using RBD.

everything is working well, but there is strange and for us quite serious issue
   - speed of write operations (both sequential and random) is constantly degrading
   drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s
   from 1 VM  to ~7k writes/s)

When I restart all OSD daemons, numbers immediately return to normal..

volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84
INTEL SSDPE2KX080T8 NVMEs.

I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..

I don't see anything suspicious, nodes load is quite low, no logs errors,
network latency and throughput is OK too

Anyone having simimar issue?

I'd like to ask for hints on what should I check further..

we're running lots of 14.2.x and 15.2.x clusters, none showing similar
issue, so I'm suspecting this is something related to quincy

thanks a lot in advance

with best regards

nikola ciprich

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx