Ok, why is ceph marking osds down? Post your ceph.log from one of the problematic periods. -Sam On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky <andrei@xxxxxxxxxx> wrote: > Hello cephers, > > I need your help and suggestion on what is going on with my cluster. A few > weeks ago i've upgraded from Firefly to Giant. I've previously written about > having issues with Giant where in two weeks period the cluster's IO froze > three times after ceph down-ed two osds. I have in total just 17 osds > between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 with > latest updates. > > I've got zabbix agents monitoring the osd servers and the cluster. I get > alerts of any issues, such as problems with PGs, etc. Since upgrading to > Giant, I am now frequently seeing emails alerting of the cluster having > degraded PGs. I am getting around 10-15 such emails per day stating that the > cluster has degraded PGs. The number of degraded PGs very between a couple > of PGs to over a thousand. After several minutes the cluster repairs itself. > The total number of PGs in the cluster is 4412 between all the pools. > > I am also seeing more alerts from vms stating that there is a high IO wait > and also seeing hang tasks. Some vms reporting over 50% io wait. > > This has not happened on Firefly or the previous releases of ceph. Not much > has changed in the cluster since the upgrade to Giant. Networking and > hardware is still the same and it is still running the same version of > Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the issues > above are related to the upgrade of ceph to Giant. > > Here is the ceph.conf that I use: > > [global] > fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe > mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib > mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 > auth_supported = cephx > osd_journal_size = 10240 > filestore_xattr_use_omap = true > public_network = 192.168.168.0/24 > rbd_default_format = 2 > osd_recovery_max_chunk = 8388608 > osd_recovery_op_priority = 1 > osd_max_backfills = 1 > osd_recovery_max_active = 1 > osd_recovery_threads = 1 > filestore_max_sync_interval = 15 > filestore_op_threads = 8 > filestore_merge_threshold = 40 > filestore_split_multiple = 8 > osd_disk_threads = 8 > osd_op_threads = 8 > osd_pool_default_pg_num = 1024 > osd_pool_default_pgp_num = 1024 > osd_crush_update_on_start = false > > [client] > rbd_cache = true > admin_socket = /var/run/ceph/$name.$pid.asok > > > I would like to get to the bottom of these issues. Not sure if the issues > could be fixed with changing some settings in ceph.conf or a full downgrade > back to the Firefly. Is the downgrade even possible on a production cluster? > > Thanks for your help > > Andrei > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com