Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com