Re: Giant upgrade - stability issues

Samuel Just <sam.just@xxxxxxxxxxx> · Tue, 18 Nov 2014 11:54:56 -0800



Ok, why is ceph marking osds down?  Post your ceph.log from one of the
problematic periods.
-Sam

On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote:
> Hello cephers,
>
> I need your help and suggestion on what is going on with my cluster. A few
> weeks ago i've upgraded from Firefly to Giant. I've previously written about
> having issues with Giant where in two weeks period the cluster's IO froze
> three times after ceph down-ed two osds. I have in total just 17 osds
> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 with
> latest updates.
>
> I've got zabbix agents monitoring the osd servers and the cluster. I get
> alerts of any issues, such as problems with PGs, etc. Since upgrading to
> Giant, I am now frequently seeing emails alerting of the cluster having
> degraded PGs. I am getting around 10-15 such emails per day stating that the
> cluster has degraded PGs. The number of degraded PGs very between a couple
> of PGs to over a thousand. After several minutes the cluster repairs itself.
> The total number of PGs in the cluster is 4412 between all the pools.
>
> I am also seeing more alerts from vms stating that there is a high IO wait
> and also seeing hang tasks. Some vms reporting over 50% io wait.
>
> This has not happened on Firefly or the previous releases of ceph. Not much
> has changed in the cluster since the upgrade to Giant. Networking and
> hardware is still the same and it is still running the same version of
> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the issues
> above are related to the upgrade of ceph to Giant.
>
> Here is the ceph.conf that I use:
>
> [global]
> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
> auth_supported = cephx
> osd_journal_size = 10240
> filestore_xattr_use_omap = true
> public_network = 192.168.168.0/24
> rbd_default_format = 2
> osd_recovery_max_chunk = 8388608
> osd_recovery_op_priority = 1
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_threads = 1
> filestore_max_sync_interval = 15
> filestore_op_threads = 8
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> osd_disk_threads = 8
> osd_op_threads = 8
> osd_pool_default_pg_num = 1024
> osd_pool_default_pgp_num = 1024
> osd_crush_update_on_start = false
>
> [client]
> rbd_cache = true
> admin_socket = /var/run/ceph/$name.$pid.asok
>
>
> I would like to get to the bottom of these issues. Not sure if the issues
> could be fixed with changing some settings in ceph.conf or a full downgrade
> back to the Firefly. Is the downgrade even possible on a production cluster?
>
> Thanks for your help
>
> Andrei
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com