Hello dear CEPH users and developers, we're dealing with strange problems.. we're having 12 node alma linux 9 cluster, initially installed CEPH 15.2.16, then upgraded to 17.2.5. It's running bunch of KVM virtual machines accessing volumes using RBD. everything is working well, but there is strange and for us quite serious issue - speed of write operations (both sequential and random) is constantly degrading drastically to almost unusable numbers (in ~1week it drops from ~70k 4k writes/s from 1 VM to ~7k writes/s) When I restart all OSD daemons, numbers immediately return to normal.. volumes are stored on replicated pool of 4 replicas, on top of 7*12 = 84 INTEL SSDPE2KX080T8 NVMEs. I've updated cluster to 17.2.6 some time ago, but the problem persists. This is especially annoying in connection with https://tracker.ceph.com/issues/56896 as restarting OSDs is quite painfull when half of them crash.. I don't see anything suspicious, nodes load is quite low, no logs errors, network latency and throughput is OK too Anyone having simimar issue? I'd like to ask for hints on what should I check further.. we're running lots of 14.2.x and 15.2.x clusters, none showing similar issue, so I'm suspecting this is something related to quincy thanks a lot in advance with best regards nikola ciprich -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: servis@xxxxxxxxxxx ------------------------------------- _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx