Giant upgrade - stability issues

Andrei Mikhailovsky <andrei@xxxxxxxxxx> · Tue, 18 Nov 2014 09:35:18 +0000 (GMT)

Hello cephers,

I need your help and suggestion on what is going on with my cluster. A few weeks ago i've upgraded from Firefly to Giant. I've previously written about having issues with Giant where in two weeks period the cluster's IO froze three times after ceph down-ed two osds. I have in total just 17 osds between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 with latest updates.

I've got zabbix agents monitoring the osd servers and the cluster. I get alerts of any issues, such as problems with PGs, etc. Since upgrading to Giant, I am now frequently seeing emails alerting of the cluster having degraded PGs. I am getting around 10-15 such emails per day stating that the cluster has degraded PGs. The number of degraded PGs very between a couple of PGs to over a thousand. After several minutes the cluster repairs itself. The total number of PGs in the cluster is 4412 between all the pools.

I am also seeing more alerts from vms stating that there is a high IO wait and also seeing hang tasks. Some vms reporting over 50% io wait. 

This has not happened on Firefly or the previous releases of ceph. Not much has changed in the cluster since the upgrade to Giant. Networking and hardware is still the same and it is still running the same version of Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the issues above are related to the upgrade of ceph to Giant. 

Here is the ceph.conf that I use:

[global]
fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
auth_supported = cephx
osd_journal_size = 10240
filestore_xattr_use_omap = true
public_network = 192.168.168.0/24
rbd_default_format = 2
osd_recovery_max_chunk = 8388608
osd_recovery_op_priority = 1
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1
filestore_max_sync_interval = 15
filestore_op_threads = 8
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_disk_threads = 8
osd_op_threads = 8
osd_pool_default_pg_num = 1024
osd_pool_default_pgp_num = 1024
osd_crush_update_on_start = false

[client]
rbd_cache = true
admin_socket = /var/run/ceph/$name.$pid.asok

I would like to get to the bottom of these issues. Not sure if the issues could be fixed with changing some settings in ceph.conf or a full downgrade back to the Firefly. Is the downgrade even possible on a production cluster?

Thanks for your help

Andrei
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com