Hi, maybe this is related ?: http://tracker.ceph.com/issues/9503 "Dumpling: removing many snapshots in a short time makes OSDs go berserk" http://tracker.ceph.com/issues/9487 "dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping" http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html I think it's already backport in dumpling, not sure it's already done for firefly Alexandre ----- Mail original ----- De: "Olivier Bonvalet" <ceph.list@xxxxxxxxx> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> Envoyé: Mercredi 4 Mars 2015 12:10:30 Objet: Perf problem after upgrade from dumpling to firefly Hi, last saturday I upgraded my production cluster from dumpling to emperor (since we were successfully using it on a test cluster). A couple of hours later, we had falling OSD : some of them were marked as down by Ceph, probably because of IO starvation. I marked the cluster in «noout», start downed OSD, then let him recover. 24h later, same problem (near same hour). So, I choose to directly upgrade to firefly, which is maintained. Things are better, but the cluster is slower than with dumpling. The main problem seems that OSD have twice more write operations par second : https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png But journal doesn't change (SSD dedicated to OSD70+71+72) : https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png Neither node bandwidth : https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png Or whole cluster IO activity : https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png Some background : The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD journal» OSD. Only «HDD+SSD» OSD seems to be affected. I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» nodes (so a total of 27 «HDD+SSD» OSD). The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= «rbd snap rm» operations). osd_snap_trim_sleep is setup to 0.8 since monthes. Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It doesn't seem to really help. The only thing which seems to help, is to reduce osd_disk_threads from 8 to 1. So. Any idea about what's happening ? Thanks for any help, Olivier _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com