Re: quincy 17.2.6 - write performance continuously slowing down until OSD restart needed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Nikola,

I'd suggest to start monitoring perf counters for your osds. op_w_lat/subop_w_lat ones specifically. I presume they raise eventually, don't they?

Does subop_w_lat grow for every OSD or just a subset of them? How large is the delta between the best and the worst OSDs after a one week period? How many "bad" OSDs are at this point?


And some more questions:

How large are space utilization/fragmentation for your OSDs?

Is the same performance drop observed for artificial benchmarks, e.g. 4k random writes to a fresh RBD image using fio?

Is there any RAM utilization growth for OSD processes over time? Or may be any suspicious growth in mempool stats?


As a blind and brute force approach you might also want to compact RocksDB through ceph-kvstore-tool and switch bluestore allocator to bitmap (presuming default hybrid one is effective right now). Please do one modification at a time to realize what action is actually helpful if any.


Thanks,

Igor

On 5/2/2023 11:32 AM, Nikola Ciprich wrote:
Hello dear CEPH users and developers,

we're dealing with strange problems.. we're having 12 node alma linux 9 cluster,
initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch
of KVM virtual machines accessing volumes using RBD.

everything is working well, but there is strange and for us quite serious issue
  - speed of write operations (both sequential and random) is constantly degrading
  drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s
  from 1 VM  to ~7k writes/s)

When I restart all OSD daemons, numbers immediately return to normal..

volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84
INTEL SSDPE2KX080T8 NVMEs.

I've updated cluster to 17.2.6 some time ago, but the problem persists. This is
especially annoying in connection with https://tracker.ceph.com/issues/56896
as restarting OSDs is quite painfull when half of them crash..

I don't see anything suspicious, nodes load is quite low, no logs errors,
network latency and throughput is OK too

Anyone having simimar issue?

I'd like to ask for hints on what should I check further..

we're running lots of 14.2.x and 15.2.x clusters, none showing similar
issue, so I'm suspecting this is something related to quincy

thanks a lot in advance

with best regards

nikola ciprich



--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux