Thanks Alexandre. The load problem is permanent : I have twice IO/s on HDD since firefly. And yes, the problem hang the production at night during snap trimming. I suppose there is a new OSD parameter which change behavior of the journal, or something like that. But didn't find anything about that. Olivier Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit : > Hi, > > maybe this is related ?: > > http://tracker.ceph.com/issues/9503 > "Dumpling: removing many snapshots in a short time makes OSDs go berserk" > > http://tracker.ceph.com/issues/9487 > "dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping" > > http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html > > > > I think it's already backport in dumpling, not sure it's already done for firefly > > > Alexandre > > > > ----- Mail original ----- > De: "Olivier Bonvalet" <ceph.list@xxxxxxxxx> > À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx> > Envoyé: Mercredi 4 Mars 2015 12:10:30 > Objet: Perf problem after upgrade from dumpling to firefly > > Hi, > > last saturday I upgraded my production cluster from dumpling to emperor > (since we were successfully using it on a test cluster). > A couple of hours later, we had falling OSD : some of them were marked > as down by Ceph, probably because of IO starvation. I marked the cluster > in «noout», start downed OSD, then let him recover. 24h later, same > problem (near same hour). > > So, I choose to directly upgrade to firefly, which is maintained. > Things are better, but the cluster is slower than with dumpling. > > The main problem seems that OSD have twice more write operations par > second : > https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png > https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png > > But journal doesn't change (SSD dedicated to OSD70+71+72) : > https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png > > Neither node bandwidth : > https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png > > Or whole cluster IO activity : > https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png > > Some background : > The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD > journal» OSD. Only «HDD+SSD» OSD seems to be affected. > > I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» > nodes (so a total of 27 «HDD+SSD» OSD). > > The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= > «rbd snap rm» operations). > osd_snap_trim_sleep is setup to 0.8 since monthes. > Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It > doesn't seem to really help. > > The only thing which seems to help, is to reduce osd_disk_threads from 8 > to 1. > > So. Any idea about what's happening ? > > Thanks for any help, > Olivier > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com