Re: Perf problem after upgrade from dumpling to firefly

Olivier Bonvalet <ceph.list@xxxxxxxxx> · Wed, 04 Mar 2015 14:49:41 +0100

Thanks Alexandre.

The load problem is permanent : I have twice IO/s on HDD since firefly.
And yes, the problem hang the production at night during snap trimming.

I suppose there is a new OSD parameter which change behavior of the
journal, or something like that. But didn't find anything about that.

Olivier

Le mercredi 04 mars 2015 à 14:44 +0100, Alexandre DERUMIER a écrit :
> Hi,
> 
> maybe this is related ?:
> 
> http://tracker.ceph.com/issues/9503
> "Dumpling: removing many snapshots in a short time makes OSDs go berserk"
> 
> http://tracker.ceph.com/issues/9487
> "dumpling: snaptrimmer causes slow requests while backfilling. osd_snap_trim_sleep not helping"
> 
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-December/045116.html
> 
> 
> 
> I think it's already backport in dumpling, not sure it's already done for firefly
> 
> 
> Alexandre
> 
> 
> 
> ----- Mail original -----
> De: "Olivier Bonvalet" <ceph.list@xxxxxxxxx>
> À: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Envoyé: Mercredi 4 Mars 2015 12:10:30
> Objet:  Perf problem after upgrade from dumpling to firefly
> 
> Hi, 
> 
> last saturday I upgraded my production cluster from dumpling to emperor 
> (since we were successfully using it on a test cluster). 
> A couple of hours later, we had falling OSD : some of them were marked 
> as down by Ceph, probably because of IO starvation. I marked the cluster 
> in «noout», start downed OSD, then let him recover. 24h later, same 
> problem (near same hour). 
> 
> So, I choose to directly upgrade to firefly, which is maintained. 
> Things are better, but the cluster is slower than with dumpling. 
> 
> The main problem seems that OSD have twice more write operations par 
> second : 
> https://daevel.fr/img/firefly/firefly-upgrade-OSD70-IO.png 
> https://daevel.fr/img/firefly/firefly-upgrade-OSD71-IO.png 
> 
> But journal doesn't change (SSD dedicated to OSD70+71+72) : 
> https://daevel.fr/img/firefly/firefly-upgrade-OSD70+71-journal.png 
> 
> Neither node bandwidth : 
> https://daevel.fr/img/firefly/firefly-upgrade-dragan-bandwidth.png 
> 
> Or whole cluster IO activity : 
> https://daevel.fr/img/firefly/firefly-upgrade-cluster-IO.png 
> 
> Some background : 
> The cluster is splitted in pools with «full SSD» OSD and «HDD+SSD 
> journal» OSD. Only «HDD+SSD» OSD seems to be affected. 
> 
> I have 9 OSD on «HDD+SSD» node, 9 HDD and 3 SSD, and only 3 «HDD+SSD» 
> nodes (so a total of 27 «HDD+SSD» OSD). 
> 
> The IO peak between 03h00 and 09h00 corresponds to snapshot rotation (= 
> «rbd snap rm» operations). 
> osd_snap_trim_sleep is setup to 0.8 since monthes. 
> Yesterday I tried to reduce osd_pg_max_concurrent_snap_trims to 1. It 
> doesn't seem to really help. 
> 
> The only thing which seems to help, is to reduce osd_disk_threads from 8 
> to 1. 
> 
> So. Any idea about what's happening ? 
> 
> Thanks for any help, 
> Olivier 
> 
> _______________________________________________ 
> ceph-users mailing list 
> ceph-users@xxxxxxxxxxxxxx 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com